Python gives you fast development—and slow code

When you’re trying to understand new data for the first time, Python is ideal for quick, interactive exploration. Whether you’re cleaning up messy data, or prototyping different analyses, Python is easy to write and easy to experiment with.

But eventually your bottleneck shifts, from your ability to come up with new ideas to how fast your code can run:

If it takes an hour to run an experiment, you can only run a handful of experiments every day.
If you need results in 30 seconds, ten minutes is far too long.
If you burn your whole budget on cloud computing, how will your employer pay your salary?

What do you do if your Python code is

way

too

slow?

There are many ways to speed up your code—

One common approach to speeding up Python code is to use an extension written in a compiled language, like Pandas or NumPy. But what do you do if your Pandas code is too slow?

Another approach is parallelism: use multiple threads or processes to take advantage of multiple CPU cores. But what happens when you run out of CPU cores? And if you’re operating at scale, can you afford to get faster results by ramping up your compute costs?

And if there are multiple approaches, where should you start? Which one is the best? What else can you do?

—here’s how you can organize and prioritize them

We’ve already learned one thing. Switching to a compiled language and using parallelism both speed up your code. But they do so in completely different ways. Often you can use both and get a multiplicative speedup!

More broadly, there are multiple, fundamentally different ways to speed of your code, each with its own set of skills and knowledge. I’ll call these practices.

In which order should you implement these practices?

In simple situations, you just need to know what your options are

Once you have a mental list of potential practices, in simple cases choosing the next one to apply may be straightforward.

Trying to do large amounts of math in pure Python? You probably should switch to using a compiled extension, or maybe write a custom extension in a compiled language. That’s the Practice of Compilation.

Already using a highly optimized compiled version of an algorithm, but it’s single-threaded? The Practice of Parallelism is the next step, using multiple threads or multiple processes.

A process for complex situations

For complex situations where you need the fastest code possible, you can try and apply multiple practices, one by one:

Step 1: The Practice of Process: The starting point is process, setting up the activities, tools, and procedures you will need to speed up your code. You need to:

Ensure your optimizations don’t break your code, for example with automated tests.
Ensure your code doesn’t slow down again in the future, for example with comparative speed benchmarks that can catch slowdowns in CI.
Figure out how to repeatedly and accurately measure the speed of your code.

Step 2: The Practice of Efficiency: Regardless of the programming language you use, you may have code that results in wasted effort.

Perhaps your code calculates the same piece of information many times instead of just once.
Or, maybe it calculates some useless information and then throws it away.

Not only will fixing these inefficiencies speed up your code, it will also help you build an intuitive model of where your program is spending its time.

Step 3: The Practice of Parallelism: While you might not want to implement multi-threaded or multi-process parallelism immediately at this point, it is at the very least worth thinking about where in your code or application parallelism might be implemented. For example:

If you have many inputs coming in that can be processed independently, you can implement parallelism by running a pool of workers which process each input in a single thread, and none of the algorithms necessarily need to be parallel internally.
On the other hand, if parallelism will need to be integrated into your algorithm, this can affect how you design and implement later steps.

Step 4: The Practice of Compilation: Switching to a compiled Python extension, or writing custom compiled code, can speed up your code compared to using plain old Python. If you’re writing your own extension, the code you end up porting to a compiled language will be the more efficient version you created in step 2.

Step 5: The Practice of Mechanical Sympathy: CPUs have a variety of performance features like instruction-level parallelism that are less impactful when writing Python. But now that you’ve switched to a compiled language, these effects become stronger.

By ensuring your code is not fighting against the hardware’s fast paths, small tweaks to your code can result in significant speedups. Some of these tweaks, like utilizing memory caches effectively, can impact some of the tuning you’ll do in the next step.

Step 6: The Practice of Parallelism, again: If you haven’t already implemented parallelism in an earlier step, next you will want to make sure your application can take advantage of multiple CPU cores. Depending on your application and libraries this might be implemented at a high level, with different processes handling different input files, or perhaps at a very low level deep inside an algorithm.

Learning the practices of performance

Modern CPUs are fast, and paradoxically that has resulted in most of the code we write being slow. By applying the practices of performance, you can utilize the full speed made available to you by your hardware.

In all, there are 4 practices (Efficiency, Compilation, Mechanical Sympathy, and Parallelism) that can speed up your code directly. The Practice of Process helps indirectly, by ensuring your efforts actually succeed, and that your code doesn’t degrade in the future.

Of course, you may need to learn how to apply them. You can read the appropriate generic books (they’ll usually expect you to know how to write C), cobbling together a reading list of blog posts, and then figuring out how to apply all this in the context of Python code.

Much better if you could just write faster code and get back to analyzing your data. To help you do that as quickly as possible, I’m working on a book that will teach you how to:

Optimize your code, whatever its language, by applying the Practice of Efficiency.
Speed up your code by switching to a compiled language, and how to work around the limits of the compiler when applying the Practice of Compilation.
Speed up your compiled code even more with the Practice of Mechanical Sympathy, by taking advantage of CPU features like instruction-level parallelism, branch prediction, SIMD, and memory caches. (It’s OK if you don’t yet know what those are! In the past, neither did I. Learning about them is what the book is for.)
Take advantage of multiple CPU cores with the Practice of Parallelism.
Apply the Practice of Process to ensure you are measuring the right thing, that your code doesn’t get slower, and that you don’t break your code while optimizing it.

Interested? Sign up below to get updates on the book, and a weekly article on Python performance, handling larger-than-memory datasets, Docker packaging, and more.

Get notified when the book comes out!

Plus, get weekly emails on speeding up Python data processing, processing larger-than-memory datasets, Docker packaging for Python, and more.

More about the book

What it’s going to cover:

The main focus will be speeding up data processing, and especially numeric computing, the kind of calculations you’d do as a data scientist, scientist, or research software engineer.
You won’t have to know C, Cython, or Rust to read the book—just knowing some Python is good enough.
However, your new knowledge about the Practices of Compilation and Mechanical Sympathy will apply to C, Cython, or Rust, or any low-level compiled language you happen to be using.

Status updates

July 2, 2025

The book draft is currently at 43,000 words. After having gone through a few rounds of early reader review, I am working on converting it to the structure implied by Practices of Performance, which has involved writing a number of new chapters, and restructuring some old ones.

The section on process has some significant gaps, and the section on parallelism is still missing; the other three practices have their core content and just need restructuring or polish.

Get notified when the book comes out!

Plus, get weekly emails on speeding up Python data processing, processing larger-than-memory datasets, Docker packaging for Python, and more.