Not just CPU: writing custom profilers for Python

Let’s say your program is slow, and you’ve determined that it’s only partially due to CPU. How do you figure out which parts of the code are waiting for things other than CPU?

In this article you’ll learn how to write custom profilers, and in particular profilers that will help you pinpoint the places in your code where it just sits there waiting. In particular, we’ll discuss:

  • What your program might be waiting for.
  • Profiling non-CPU time.
  • Profiling involuntary context switches.

What is your program waiting for?

During the times your program isn’t burning through CPU, it’s presumably waiting for something else. Something else could be:

  • Network communication. This might include DNS lookups, waiting for the other side to respond, waiting for all the data to download, and so on.
  • Your harddrive. Reading from disk can take time; so can writing to disk. Sometimes reads and writes go straight to a RAM cache, which is faster, but sometimes they go to disk, which is slow.
  • Locks. Your program might be waiting for a threading or process lock.
  • Sleeping. Sometimes your program will deliberately sleep, e.g. to back off on retries.

So how do you find places where these operations are occurring?

Technique #1: Non-CPU time

Python’s built-in cProfile profiler can profile using any counter that goes up, and so you can take advantage of that fact to build a profiler of non-CPU time.

As I discussed in a previous article, your operating system can tell you how many CPU seconds your process used.

Let’s assume a single-threaded program, since multi-threaded programs are harder to profile and reason about. If your program has been running for 9 seconds, and used 7.5 CPU seconds, that means it spent 1.5 seconds waiting.

First, you construct a timer that measures non-CPU time:

import os

def not_cpu_time():
    times = os.times()
    return times.elapsed - (times.system + times.user)

And then you create a profiler that profiles using this new measure of time:

import cProfile, pstats

def profile_not_cpu_time(f, *args, **kwargs):
    prof = cProfile.Profile(not_cpu_time)
    prof.runcall(f, *args, **kwargs)
    result = pstats.Stats(prof)
    result.sort_stats("time")
    result.print_stats()

And now you can profile various functions:

>>> profile_not_cpu_time(
...     lambda: urlopen("https://pythonspeed.com").read())

ncalls  tottime  percall  filename:lineno(function)
    3    0.050    0.017   _ssl._SSLSocket.read
    1    0.040    0.040   _socket.getaddrinfo
    1    0.020    0.020   _socket.socket.connect
    1    0.010    0.010   _ssl._SSLSocket.do_handshake
  342    0.010    0.000   find.str
  192    0.010    0.000   append.list

Most of the waiting time was spent reading from the socket, but some time was spent doing a DNS lookup (getaddrinfo), doing the TCP handshake (connect), and doing the TLS/SSL handshake.

And since we were careful not to include CPU time, we know this is purely waiting time, not computation.

Note: Why is there time recorded for str.find and list.append? They don’t do any waiting, so plausibly this is a situation where the whole process wasn’t running, perhaps because some other process was scheduled to run, or because some memory had to be loaded from swap. That means some time elapsed, but no CPU time did.

I also saw a small amount negative time elapsed, which suggests some inconsistency between elapsed and CPU time, but for less-trivial programs I wouldn’t expect it to have significant impact.

Technique #2: Voluntary context switches

The problem with measuring elapsed time is that it can vary based on things that aren’t under the program’s control. Sometimes DNS queries will be slower, sometimes downloads will be slower. So it would be useful to also have a more consistent measure that isn’t tied to how slow the external universe is.

One way to do that is to measure how many operations the process did that required a wait: counting the number of waits, rather than the time spent waiting.

Here’s how you do it. When your process stops using the CPU, it can be for two reasons:

  1. Every time a process does an operation that isn’t going to succeed immediately—reading from a socket, sleeping, and so on—it is basically telling the operating system “wake me when I can proceed.” This is a “voluntary context switch”: the CPU can switch to running a different process until data is available on the socket, the sleep is over, and so on.
  2. An “involuntary context switch” is when the operating system stops running a process temporarily so that another process can use the CPU.

For our purposes, then, we want to profile using voluntary context switches.

You can write a profiler that measures voluntary context switches by using the psutil library:

import psutil

_current_process = psutil.Process()

def profile_voluntary_switches(f, *args, **kwargs):
    prof = cProfile.Profile(
        lambda: _current_process.num_ctx_switches().voluntary)
    prof.runcall(f, *args, **kwargs)
    result = pstats.Stats(prof)
    result.sort_stats("time")
    result.print_stats()

And then you can profile the network code again:

>>> profile_voluntary_switches(
...     lambda: urlopen("https://pythonspeed.com").read())

ncalls  tottime  percall  filename:lineno(function)
     3    7.000    2.333  _ssl._SSLSocket.read
     1    2.000    2.000  _ssl._SSLSocket.do_handshake
     1    2.000    2.000  _socket.getaddrinfo
     1    1.000    1.000  _ssl._SSLContext.set_default_verify_path
     1    1.000    1.000  _socket.socket.connect

Now instead of time spent waiting, you see the number of times voluntary context switches occurred.

Note that occasionally you’ll see voluntary context switches in unexpected places—my guess is that’s due to page faults, swapped-out memory being loaded.

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks
A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage

Profile all the things

Using this profiling technique has some costs: it’ll slow down your program a lot because of the additional overhead. In most cases this shouldn’t distort the results too much, however, because you’re not profiling CPU.

In general, you can profile any number the goes up. For example:

  • The number of read operations (psutil.Process().read_count) and write operations (psutil.Process().write_count).
  • On Linux, the total number of bytes read and written (psutil.Process().read_chars).
  • Memory allocation for memory profiling (this would require some work, but it should be possible by using jemalloc).

See the the psutil documentation for more information about the first two.

If you have other ideas of things to profile, send me an email and I’ll add them to the list.