Invasive procedures: Python affordances for performance measurement

by Itamar Turner-Trauring
Last updated 16 Feb 2023, originally created 17 Aug 2022

When your Python code is too slow, you need to identify the bottleneck that’s causing it: you need to understand what your code is doing. Luckily, beyond pre-existing profiling tools, there are also a variety of ways you can poke and prod Python programs to get a better understanding of what they’re doing internally.

This allows you to do one-time introspection, add profiling facilities to your program that you can turn on and off, build custom tools, and in general get a better understand of what your program is doing.

Some of these affordances are quite awful, but that’s OK! Performance debugging is a different kind of coding than writing long-term maintainable code.

In this article we’ll cover:

Runtime object mutation (“monkey patching”).
Code patching.
Runtime mutation of C types.
Audit hooks.
sys._current_frames().
Profiling and tracing hooks.
And more!

The scope of this article

To keep this article from being too long:

The article omits operating-system specific facilities like Linux’s LD_PRELOAD, ptrace(), eBPF, /proc/<pid>/mem, and so on.
Instead, it will focus on Python-specific affordances, and specifically those supported by CPython, the default Python interpreter most people use. Some of these APIs will be supported by other Python implementations like PyPy; others will not.
Since even that list is too long, the article also omits affordances related to memory usage, as well as those that seem mostly useful for debugging rather than performance.

Future articles may cover some of the affordances that were omitted.

If I’ve left something out, please let me know and I’ll add it.

1. Runtime object mutation

Pretty much any Python objects can be replaced with another object of your choice by just setting an attribute in the right place. This is sometimes known as “monkey patching”.

>>> import os
>>> os.listdir(".")
['somefile.txt']
>>> def mylistdir(path):
...     return ["LIES"]
...
>>> os.listdir = mylistdir

>>> os.listdir(".")
['LIES']

You can override a module:

>>> import sys
>>> sys.modules["os"]
<module 'os' from '/usr/lib/python3.10/os.py'>
>>> sys.modules["os"] = 123
>>> del os
>>> import os
>>> os
123

You can also replace methods on a class, and so on and so forth. The only caveat is that any references to the object that exist before you override it won’t be replaced. For example, if some other module already did from os import listdir, updating os.listdir after that will not impact the version in the module that did that import.

Why is this useful for performance measurement? Imagine you want to know how many times a specific class instance is created. One way to do so is by overriding the class’ __init__:

from ipaddress import IPV4Address

counter = 0

original_init = IPV4Address.__init__

def override_init(*args, **kwargs):
    # Note this isn't thread-safe...
    global counter
    counter += 1
    return original_init(*args, **kwargs)

IPV4Address.__init__ = override_init

Other ideas:

Instead of a counter, you can just use a print() call combined with an ad-hoc profiling tool like counts.
You can use sys._getframe() to get the calling function if you want to see who is calling the function.

`builtins`

There are a number of built-in Python functions and types that you don’t need to import, like open() and list. They reside in the builtins module, which you can mutate too; this will change the builtins you call for all modules.

>>> open
<built-in function open>
>>> def myopen(*args):
...     print("FAKE")
...
>>> import builtins
>>> builtins.open = myopen
>>> open
<function myopen at 0x7f62a6995510>
>>> open("file.txt")
FAKE

2. Code patching

We noted before one of the limitations of monkey patching: any existing references are not changed. So if you want to change a function’s behavior with monkey patching, you need to swap out all references across all modules, which can be tricky or impossible.

Luckily, this is Python, so you can do lots of different bad things. In particular, you can change the code the function runs; all existing references, as well as future ones, will run the new code.

>>> def one():
...     return 1
...
>>> def two():
...     return 2
...
>>> one()
1
>>> two()
2
>>> one.__code__ = two.__code__
>>> one()
2

`patchy`

For performance instrumentation, you probably just want to change or add a tiny bit of code and otherwise have the same behavior. One way to do so is with the patchy library, which lets you apply a diff to the function’s source code. That way you don’t have to rewrite all its code, and if the function’s source code changed, your patching will fail with a useful error message.

For example, we can patch IPv4Address.__init__ (or at least, the version in Python 3.10.4):

from ipaddress import IPv4Address
from patchy import patch

patch(IPv4Address.__init__, '''\
    @@ -14,5 +14,6 @@
             AddressValueError: If ipaddress isn't a valid IPv4 address.

         """
    +    print("IPv4Address created")
         # Efficient constructor from integer.
         if isinstance(address, int):''')

addr = IPv4Address("127.0.0.1")
print(repr(addr))

And when run the example:

$ python example.py
IPv4Address created
IPv4Address('127.0.0.1')

Note that this will patch every call to IPv4Address.__init__, regardless of what module calls it.

3. Runtime mutation of C types with `forbiddenfruit`

Monkey patching doesn’t work on Python extension types that are implemented in C or some other low-level language:

>>> list.append = lambda self, i: print("Appending", i)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot set 'append' attribute of immutable type 'list'

When Python fails us, we can just do terrible things with a memory-unsafe language like C, or a memory-unsafe API like ctypes.

In this case, the forbiddenfruit library has done all the heavy lifting for us:

>>> from forbiddenfruit import curse
>>> original_append = list.append
>>> def myappend(self, item):
...     print("Appending", item)
...     return original_append(self, item)
...
>>> curse(list, "append", myappend)
>>> l = [1, 2]
>>> l.append(3)
Appending 3
>>> l
[1, 2, 3]

4. Audit hooks

Whenever certain events happen, Python will create an audit event. You can listen to these audit events with a registered audit hook function, by using sys.addaudithook. For example, you can keep track of all file opens:

>>> def audit(event, args):
...     if event == "open":
...         print("Opened", args[0])
...
>>> import sys
>>> sys.addaudithook(audit)
>>> f = open("/etc/passwd")
Opened /etc/passwd

You can see the full list of built-in audit events here.

Custom events

In addition to built-in audit events, you can also emit your own events, for example to keep track how many times some function is called. You do so with sys.audit():

sys.audit("myevent", 1, 2)

If there are no audit hooks registered, this is quite cheap by Python standards:

$ python -m timeit -s "import sys" "sys.audit('myevent', 1, 2)"
10000000 loops, best of 5: 21.8 nsec per loop

(Thanks to David Reid for the idea.)

C API

You can emit custom events with PySys_Audit and register a hook with PySys_AddAuditHook. As usual, using the C API should reduce the performance overhead significantly.

5. `sys._current_frames()`

You can use sys._current_frames() to get pointers to the current frame in all running threads. The frame is the object that keeps track of the current function call, so it will have an (indirect) reference to the current function, as well as to the locals in scope.

>>> import sys, time, threading
>>> def mythread():
...     time.sleep(100)
...
>>> threading.Thread(target=mythread).start()
>>> sys._current_frames()
{139693033064000: <frame at 0x7f0cd1a88cc0, file '<stdin>', line 2, code mythread>,
 139693046566912: <frame at 0x7f0cd1a88810, file '<stdin>', line 1, code <module>>}

This can, for example, be used to implement a simple sampling profiler.

6. Profiling and tracing hooks

sys.setprofile

sys.setprofile lets you register a Python function that will get called whenever a Python function is called or returns, and whenever a C function wrapped in Python is called or returns.

import sys

def profile(*args):
    print(args)

sys.setprofile(profile)

def g():
    return 2

def f():
    return g()

f()

And when we run it:

$ python example.py
...
(<frame at 0x7f3dc39bc, file 'example.py', line 11, code f>, 'call', None)
(<frame at 0x7f3dc3945, file 'example.py', line 8, code g>, 'call', None)
(<frame at 0x7f3dc3945, file 'example.py', line 9, code g>, 'return', 2)
(<frame at 0x7f3dc39bc, file 'example.py', line 12, code f>, 'return', 2)
...

sys.settrace

sys.settrace works the same way as sys.setprofile, but it reports slightly different events: Python function calls and returns, opcodes, and lines. That gives you much more fine-grained tracing of execution, at the cost of even higher performance overhead.

C APIs

Both APIs have C equivalents: PyEval_SetProfile and PyEval_SetTrace. The main benefit is reduced performance overhead.

For details on using these see this helpful blog post by Ned Batchelder.

7. Custom profiling metrics with `cProfile`

The cProfile profiler built-in to Python uses PyEval_SetProfile to profile your code, emitting output that it can render as a table, or that you can further visualize with tools like SnakeViz.

One handy feature of cProfile is that you can use it with any metric of your choice, so long as it’s an increasing number. That means you can easily write custom cProfile-based profilers.

8. Custom frame evaluator

In Python 3.9 or later, or 3.7 or later if you’re willing to use extra-private APIs, you can replace the C function that wraps evaluating frames using the _PyInterpreterState_SetEvalFrameFunc API. Since the default frame function is still available, you can use this as a fast way to get notification of the start and finish of function calls. This is part of how the Sciagraph performance profiler works.

Unfortunately this also relies on internal details of CPython, especially in 3.11 where the guarantees for stability are even weaker. It’s unclear whether this API will still exist in Python 3.12.

For more details see PEP 523.

Why you should collect affordances

Many of the affordances we covered above are horrible hacks you shouldn’t use in production if at all possible. But here’s the thing: performance is a domain where the implementation details matters. It’s transgressive. So the tools you use for debugging performance problems aren’t necessarily the tools you’d use to write normal code.

That’s why it’s worth building a mental collection of all the affordances your platform gives you, including the horrible ones. Sooner or later you’ll hit a mysterious performance problem or bug. When that happens, the more tools you know about, the more ways you’ll have to try to diagnose the problem so that you can fix it.

Consulting services: take your code from prototype to production

You have a working Python prototype for your data processing algorithm. Now you need to get it ready for production. Which means your software needs to be fast, robust, maintainable, cost-efficient, and scalable.

With more than 25 years experience of shipping software to production, I can help you:

Speed up your code so it can get results on time, and run at scale with an affordable operating budget.

Learn about tools, techniques, and process improvements that will help you ship best-practices software, on schedule.

To get in touch about consulting services, send me an email at itamar@pythonspeed.com.

Speed up your Python code and learn skills you can use at your job

Join over 8000 Python developers and data scientists learning practical tools and techniques every week, from Python performance to Docker packaging, by signing up for my newsletter.