Transgressive Programming: the magic of breaking abstractions

by Itamar Turner-Trauring
Last updated 05 Apr 2022, originally created 25 Feb 2021

In programming as in social life, there are boundaries we try not violate: we build software with abstractions, boundaries between the complexity beneath and the utility we want to achieve. You can transgress these boundaries; you can bypass abstractions, or rely on implementation details. The accepted wisdom, however, is that good programming sticks to abstraction boundaries.

But programming requires varying mindsets for varying situations and goals, and good habits in one situation might not suffice in another. Performance is a case in point: if you want to write fast code, sometimes you have to write code that violates abstraction boundaries.

For problems like performance and debugging, the usual good habits of programming—sticking to abstraction boundaries, not making assumptions about implementation—are often not sufficient. In these situations, you might need to deliberately violate abstraction boundaries, deliberately writing what feels like bad code.

For lack of a better term, I’m going to call this transgressive programming. The term was inspired by Siderea’s classic essay The Asshole Filter, which points out that being an asshole is about being transgressive, about violating social boundaries and rules.

I deliberately chose a term with negative connotations because there’s a reason this isn’t the default: breaking boundaries is not a thing to do lightly. But sometimes it is necessary. Some of what is called “systems programming” is trangressive programming, but so is writing a browser polyfill, or testing tricky code by monkeypatching.

Let’s see why “good programming” involves sticking to abstraction boundaries, and some of the situations where that’s not sufficient.

Why we stick to abstractions

In order to see why we usually try hard to stick to abstraction boundaries, let’s consider two kinds of transgression.

Transgression #1: Relying on implementation details

Let’s compare two numbers in Python:

>>> x = 12
>>> y = 12
>>> x == y
True

As expected.

There’s another way we can compare these numbers: using the is operator, which compares identity, i.e. whether two objects are pointing to same location in memory.

>>> x is y
True

Does that mean we can use is instead of ==? Let’s try another number.

>>> x = 123456
>>> y = 123456
>>> x == y
True
>>> x is y
False

As expected x == y, but x is y worked differently than it did for a small number.

Here’s why: whenever you create a number in Python, it allocates a whole new object. This is slow and has memory overhead, so the default implementation of Python, known as CPython, has an optimization: small integers are cached and reused. Every 12 in Python is the same object as every other 12, but that’s not true for large integers.

This is why you shouldn’t use is for logical equality: == compares whether the two objects have the same value, even if they’re in different places in memory. In contrast, is checks whether two objects point to the same address in memory, which for integers at least can vary based on their value.

Using is for equality of arbitrary objects violates an abstraction boundary by relying on implementation details: stable memory addresses may or may not be an implementation detail.

Transgression #2: Crossing boundaries

Another way we could compare two objects is by calling the Python C API that is the implementation underlying the == operator. Here’s my first attempt:

>>> x = 123
>>> y = 123
>>> z = 456
>>> import ctypes
>>> pyapi = ctypes.PyDLL(None)  # load symbols from the executable
>>> Py_EQ = 2  # copy/pasted from CPython headers
>>> pyapi.PyObject_RichCompareBool(id(x), id(y), Py_EQ)
1
>>> pyapi.PyObject_RichCompareBool(id(x), id(z), Py_EQ)
Segmentation fault (core dumped)

I could spend a bit more time figuring out how to use this API the correct way, but the failure mode is educational too: using C APIs is inherently more dangerous than Python, and pure Python x == y isn’t going to cause segfault the vast majority of the time.

Moreover, I am again relying on implementation-specific details; in other implementations of Python (PyPy, Jython) this wouldn’t work.

When transgression is necessary

As we’ve seen, abstraction boundaries are there for a reason, and in general we try not to transgress them:

We avoid relying on implementation details.
We try not to cross abstraction boundaries unnecessarily.

For programming that involves writing business logic, these are fine rules. But not all programming is about business logic.

Performance

In order to write fast software, you eventually need to understand not just the guarantees provided by an abstraction, but how it actually works.

Some examples:

The memory abstraction provided by the operating system is that memory is a uniform array of bytes you can read and write. However, the way memory access is actually implemented means that linear scans of an array are much faster than random access—and even more so for large chunks of memory.
Writing extra fast low-level code in a language like Rust requires building at least a minimal mental model of how the CPU works, e.g. that memory access is pretty slow compared to local stack variable access (because the latter are in CPU registers).
Understanding how Python represents objects can save you memory.
How Python was compiled can have a significant performance impact.
Nelson Elhage talks about how building a completely new HTTP client using a feature almost no one uses can (in one specific situation) make everything run faster.

Debugging and introspection

When things go wrong, sometimes understanding implementation details, or crossing abstraction boundaries, is the only way to figure out the problem. Some examples:

You can use strace to trace system calls, for example tracing execve to see which subprocesses your process is launching. Last I week I used this and discovered that certain Python standard library modules will run a subprocess when imported.
You can use lsof to see which files and sockets your program has opened.
You can get the current Python frame (with an implementation-specific, abstraction-violating API) and see which function is calling your code.

Testing

Setting arbitrary attributes on arbitrary Python objects—modules, classes, functions, even builtins—is possible, but not something you usually want to do in real code. Here, for example, I am overriding the builtin int() object:

>>> class myint(int):
...     def __repr__(self):
...         return "I am definitely not an int"
... 
>>> __builtins__.int = myint
>>> int(123) + 1 == 124
True
>>> int(123)
I am definitely not an int
>>> 

When you’re writing tests, however, this is might be easiest or sometimes only way to test some code. This is useful enough, in fact, that Python provides an API for this specific use case.

Into the unknown

Normal programming stays within abstraction boundaries, and for good reasons: it’s safer, it’s less brittle, it’s where you write the business logic that ultimately makes software useful for people who aren’t programmers. But sometimes that’s not enough, sometimes you need to break those boundaries, sometimes you need to switch to transgressive programming. But crossing boundaries can be disquieting, and sometimes intimidating.

The flip side is that learning how things work underneath is a superpower. Suddenly you can do things that would have otherwise been impossible—though of course, sometimes they weren’t possible for very good reasons. Breaking across abstraction boundaries feels like you’ve learned how to do magic.

A lovely representation of both of these feelings is the song “Into the Unknown” from Frozen 2. So give it a listen, and if you’d like to learn more about what lies underneath in Python, read some more of my articles on performance and memory usage.

Consulting services: take your code from prototype to production

You have a working Python prototype for your data processing algorithm. Now you need to get it ready for production. Which means your software needs to be fast, robust, maintainable, cost-efficient, and scalable.

With more than 25 years experience of shipping software to production, I can help you:

Speed up your code so it can get results on time, and run at scale with an affordable operating budget.

Learn about tools, techniques, and process improvements that will help you ship best-practices software, on schedule.

To get in touch about consulting services, send me an email at itamar@pythonspeed.com.

Speed up your Python code and learn skills you can use at your job

Join over 8000 Python developers and data scientists learning practical tools and techniques every week, from Python performance to Docker packaging, by signing up for my newsletter.