Beyond cProfile: Choosing the right tool for performance optimization
Note: This article is based on a talk I gave at PyGotham 2019. You can watch the video here.
Your Python program is too slow. Maybe your web application can’t keep up, or certain queries are taking a long time. Maybe you have a batch program that takes hours or even days to run.
How do you speed up?
The basic process you’re likely to follow is:
- Choose the right tool to measure speed.
- Use the tool to figure out the bottleneck.
- Fix the bottleneck.
This article will focus on the first step: choosing the right tool. And in particular, it will cover:
- cProfile: The Python standard library deterministic profiler.
- Pyinstrument: A sampling profiler.
- Eliot: A logging library.
I won’t go into a huge amount of detail about how to use these tools, because the goal is to help you choose the right one. But I will explain what these tools do, and when and why you would choose one over the other.
cProfile: A deterministic profiler
The cProfile profiler is built-in to Python, so you’ve probably heard of it, and it may be the default tool you use. It works by tracing every function call in a program. That’s why it’s a deterministic profiler: if you run it with the same inputs it’ll give the same output.
By default cProfile measures process CPU—how much CPU your process has used so far. You should always ask what your profiler is measuring, because different measures can detect different problems. Measuring CPU means you can’t detect slowness caused by other causes, like waiting for a response from a database query.
While cProfile is always available in your Python install, it also has some problems—and as you’ll see, most of the time you don’t want to use it.
Using cProfile is pretty easy. If you have a script you usually run directly like so:
$ python benchmark.py 7855 messages/sec
Then you can just prefix
python -m cProfile to run it under the profiler:
$ python -m cProfile benchmark.py 1444992 function calls (1443814 primitive calls) in 1.208 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 1.208 1.208 benchmark.py:7(<module>) 1 0.048 0.048 1.151 1.151 benchmark.py:21(run) 50000 0.031 0.000 0.657 0.000 _output.py:182(write) 50000 0.045 0.000 0.618 0.000 _output.py:73(send) 50000 0.064 0.000 0.566 0.000 _output.py:452(__call__) ... a million more lines of output ...
There’s also a Python profiling API, so you can profile particular functions in a Python interpreter prompt or a Jupyter notebook.
The output format is a table, which isn’t ideal: each row is a function call that ran during the profiled time span, but you don’t know how that function call is related to other function calls. So if you have a function that can be reached from multiple code paths, it can be difficult to figure out which code path was responsible for calling the slow function.
What cProfile can tell you
If you look at the table above, you can see that:
_output.py(__call__)was called 50,000 times. It’s an even number because this is a benchmark script that runs the same code in a loop 10,000 times. If you weren’t deliberately calling a function many times, this can be useful for spotting high number of calls is useful for identifying busy inner loops.
_output.py(send)used 0.618 CPU seconds (including CPU time of functions it called), and 0.045 CPU seconds (not including functions it called).
Using this information you can spot slow functions you can optimize—slow as far as CPU goes, anyway.
How it works
cProfile measures every single function call. In particular, every function call in the run gets wrapped like so:
start = process_time() try: f() finally: elapsed = process_time() - start
The profiler records CPU time at the start and finish, and the difference is allocated to that function’s account.
The problems with cProfile
While cProfile is always available in any Python install, it also suffers from some significant problems.
Problem #1: High overhead and distorted results
As you can imagine, doing extra work for every function call has some costs:
$ python benchmark.py 7855 messages/sec $ python -m cProfile benchmark.py 5264 messages/sec ... cProfile output ...
Notice how much slower the cProfile run is. And what’s worse, the slowdown isn’t uniform across your program: because it’s tied to number of function calls, parts of your code with more function calls will be slowed down more. So this overhead can distort the results.
Problem #2: Too much information
If you’ll recall the cProfile output we saw above, it includes a row for every single function that was called during the program run. Most of those function calls are irrelevant to our performance problem: they run quickly, and only once or twice.
So when you’re reading cProfile output, you’re dealing with a lot of extra noise masking the signal.
Problem #3: Offline measure of performance
Quite often your program will only be slow when run under real-world conditions, with real-world inputs. Maybe only particular queries from users slow down your web application, and you don’t know which queries. Maybe your batch program is only slow with real data.
But cProfile as we saw slows do your program quite a bit, and so you likely don’t want to run it in your production environment. So while the slowness is only reproducible in production, cProfile only helps you in your development environment.
Problem #4: Performance is measured for functions only
cProfile can tell you “
slowfunc() is slow”, where it averages out all the inputs to that function.
And that’s fine if the function is always slow.
But sometimes you have some algorithmic code that is only slow for specific inputs. It’s quite possible that:
cProfile will not be able to tell you which inputs caused the slowness, which can make it more difficult to diagnose the problem.
cProfile: Usually insufficient
As a result of these problems, cProfile shouldn’t be your performance tool of choice. Instead, next we’ll be talking about two alternatives:
Pyinstrument: a sampling profiler
Pyinstrument solve two of the problems we covered above:
- It has lower overhead than cProfile, and it doesn’t distort the results.
- It filters out irrelevant function calls, so there’s less noise.
Pyinstrument measures elapsed wallclock time, not CPU time, so it can catch slowness caused by network requests, disk writes, lock contention, and so on.
How you use it
Using Pyinstrument is similar to cProfile; just add a prefix to your script:
$ python benchmark.py 7561 messages/sec $ python -m pyinstrument benchmark.py 6760 messages/sec ... pyinstrument output ...
Notice it has some overhead, but not as much as cProfile—and the overhead is uniform.
Pyinstrument also has a Python API, so you can use it to profile particular pieces of code in a Python interactive interpreter or a Jupyter notebook.
Pyinstrument’s output is a a tree of calls, measuring wall clock time:
1.506 <module> benchmark.py:7 ├─ 1.477 run benchmark.py:21 │ ├─ 0.604 attach myapp/_code.py:52 │ │ └─ 0.598 write myapp/_code.py:130 │ │ ├─ 0.557 final_write myapp/_foo.py:181 ...
Unlike cProfile’s row-per-function, Pyinstrument gives you a tree of function calls, so you can see the context of the slowness. A function might appear multiple times if the slowness is caused by multiple code paths.
As a result, Pyinstrument’s output is much easier to interpret, and gives you much better understanding of your program’s performance structure than cProfile’s default output.
How it works (cat edition)
Imagine you have a cat. You wish to know how that cat spends its time.
You could spy on its every moment, but that would be a lot of work. So instead you decide to takes samples: every 5 minutes you stick your head into the room where the cat is, and write down what it’s doing.
- 12:00: Sleeping 💤
- 12:05: Sleeping 💤
- 12:10: Eating 🍲
- 12:15: Using the litterbox 💩
- 12:20: Sleeping 💤
- 12:25: Sleeping 💤
- 12:30: Sleeping 💤
A few days later you can summarize your observations:
- 80%: Sleeping 💤
- 10%: Eating 🍲
- 9%: Using the litterbox 💩
- 1%: Staring longingly through the window at birds 🐦
So how accurate is this summary? Insofar as your goal is to measure where the cat spent most of its time, it’s probably accurate. And the more frequent the observations (==samples) and the more observations you make, the more accurate the summary.
If your cat spends most of its time sleeping, you would expect most sampled observations to show that it’s sleeping. And yes, you’ll miss some quick and rare activities—but for purposes of “what did the cat spend most of its time on” those quick and rare activities are irrelevant.
How it works (software edition)
Much like our cat, Pyinstrument samples a Python program’s behavior at intervals: every 1ms it checks what function is currently running. That means:
- If a function is cumulatively slow, it’ll show up often.
- If a function is cumulatively fast, we usually won’t see it at all.
That means our performance summary has less noise: functions that are barely used will mostly be skipped. But overall the summary is pretty accurate in terms of how the program spent its time, so long as we took enough samples.
Eliot: A logging library
The final tool we’ll cover in detail is Eliot, a logging library that I wrote. It solves the other two problems we saw with cProfile:
- Logging can be used in production.
- Logging can record arguments to functions.
As you’ll see, Eliot provides some unique capabilities that make it better at recording performance than normal logging libraries. That being said, with some extra work you can get the same benefits from other logging libraries as well.
Adding logging to existing code
Consider the follow sketch of a program:
def add(a, b): # ... implementation ... def multiply(a, b): # ... implementation ... def multiplysum(a, b, c): return multiply(add(a, b), c) print(multiplysum(1, 2, 4))
We can take this code and add some logging to it:
from eliot import log_call, to_file to_file(open("out.log", "w")) @log_call def add(a, b): # ... implementation ... @log_call def multiply(a, b): # ... implementation ... # etc.
Specifically, we do two things:
- Tell Eliot where to output the log messages (in this case, a file called “out.log”).
- We decorate the functions with a
@log_calldecorator. This will log the fact the function was called, its arguments, and the return value (or raised exception).
Eliot has other, more fine-grained APIs, but
@log_call suffices for demonstrating the benefits of logging.
Once we run the program, we can look at the logs using a tool called eliot-tree:
$ python calc.py 12 $ eliot-tree out.log ─── multiplysum (inputs a=1 b=2 c=4) ⧖ 10.0s ├── add (inputs a=1 b=2) ⧖ 1.0s │ └── result: 3 ├── multiply (inputs a=3 b=4) ⧖ 9.0s │ └── result: 12 └── result: 12
Notice that, a little like Pyinstrument, we’re looking at a tree of actions. I’ve simplified the output a little—originally so it could fit on a slide I used in the talk version of this article—but even in a prose article it lets us focus on the performance aspect.
In Eliot, each action has a start and a finish, and can start other actions—thus the resulting tree. Since we know when each logged action starts and finishes, we also know how long it took.
In this case, each action maps one-to-one with a function call. And there are some differences from Pyinstrument’s output:
- Instead of combining multiple function calls, you see each individual call separately.
- You can see the arguments and return result of each call.
- You can see the elapsed wallclock time of each action.
For example, you can see that
multiplysum() took 10 seconds, but the vast majority of the time was spent in
multiply(), with the inputs of 3 and 4.
So you immediately know that for performance optimization you want to focus on
multiply(), and you have some starting inputs (3 and 4) to play around with.
The limitations of logging
Logging isn’t sufficient on its own as a source of performance information.
First, you only get information from code where you’ve explicitly added logging calls. A profiler can run on any arbitrary piece of code with no advance preparation, but with logging you have to do some work in advance.
If you didn’t add logging code, you won’t get any information. Eliot makes this a bit better, since the tree-of-actions structure gives you some sense where time was spent, but it’s still not sufficient if logging is too sparse.
Second, you can’t add logging everywhere because that will slow down your program. Logging isn’t cheap—it’s higher overhead than cProfile. So you need to add it judiciously, in key points where it will maximize the information it gives without impacting performance.
Choosing the right tools
So when should you use each tool?
1. Always add logging
Any non-trivial program probably needs some logging, if only to catch bugs and errors. And if you’re already adding logging, you can take the trouble to log the information you need to do performance debugging as well.
Eliot makes it easier, since logging actions inherently gives you elapsed time, but you can with some extra work do this with any logging library.
2. Next, use a sampling profiler like Pyinstrument
Logging might help you spot the specific place your program is slow, and at the very least some inputs causing slowness, but it’s often insufficient. So the next step is to use a profiler, and in particular a sampling profiler like Pyinstrument:
- It has low overhead, and more importantly doesn’t distort the results.
- It measures wallclock time, so it doesn’t assume CPU is bottleneck.
- It only outputs the slower functions, omitting the irrelevant fast functions.
3. Use cProfile if you need a custom cost metric
If you ever need to write a custom profiler, cProfile allows you to plug in different cost functions, making it an easy tool for measuring more unusual metrics.
You can measure:
- Not-CPU, all the time spent waiting for non-CPU events.
- The number of voluntary context switches, i.e. the number of system calls that take a long time.
- Memory allocations.
- More broadly, any counter that goes up.
As a good starting point on performance optimization tools, I suggest that you:
Lack of programming skills getting in the way of building data science models? Make your data science team more productive, by improving their software engineering skills with my training classes.
How do you process large datasets with limited memory?
Get a free cheatsheet summarizing how to process large amounts of data with limited memory using Python, NumPy, and Pandas.
Plus, every week or so you’ll get new articles showing you how to process large data, and more generally improve you software engineering skills, from testing to packaging to performance: