Optimizing your code is not the same as parallelizing your code

You’re processing a large amount of data with Python, the processing seems easily parallelizable—and it’s sloooooooow.

The obvious next step is switch to some sort of multiprocessing, or even start processing data on a cluster so you can use multiple machines. Obvious, but often wrong: switching straight to multiprocessing, and even more so to a cluster, can be a very expensive choice in the long run.

In this article you’ll learn why, as we:

  1. Consider two different goals for performance: faster results and reduced hardware costs.
  2. See how different approaches achieve those goals.
  3. Suggest a better order for many situations: performance optimization first, only then trying parallelization.

Faster results vs. lower hardware costs

When it comes to speeding up your software, there are actually two different goals you might be aiming for:

  • Faster results: In general, waiting an hour is a lot worse than waiting for a minute, and in some problem domains you have specific requirements for how fast you get results.
  • Lower hardware costs: Slow results can often be solved by buying or renting more expensive hardware… but that requires more money. So you might want to speed up your software in order to reduce your hardware costs.

In an ideal world you would get fast results with very little money; in the real world, you are often forced to trade off between the two depending on the specifics of your situation. We’ll consider specific situations and resulting tradeoffs later on. For now, let’s consider how particular techniques can help you achieve these two goals.

Parallelism, whether on multiple CPUs or multiple machines, can only give you faster results. Switching from one CPU to four CPUs might give you close to a 4× speedup for embarrassingly parallel problems—but now you have to pay for 4× as many CPUs.

Optimizing your code can give you both faster results and lower costs. If you manage to speed up your code so it runs twice as fast on a single CPU as it did before, you can:

  • Get results twice as fast,
  • or you can cut your hardware costs in half when you scale up processing,
  • or you can get somewhat faster results together with somewhat lower hardware costs.

To summarize:

  Faster results Lower hardware costs
Parallelizing ❌ (but see note)

Note: This hardware cost model is a simplification. Hardware costs don’t necessarily go up linearly with number of processors, for example because processors aren’t the only hardware in a computer. You might be able to double the processors for less than double the money.

Some scenarios with corresponding goals

Depending on your particular situation, you might care about faster results, hardware costs, or perhaps both. Let’s consider two contrasting examples; throughout I am assuming that your problem is amenable to parallel processing.

One-time processing on your computer. If you’re processing data just once, you could try optimizing it for speed, but that would take some of your expensive time. Hardware-wise, your existing computer is a sunk cost, it has multiple CPUs, and they’re often idle.

Chances are then that your main goal is faster results, and using parallelism to utilize all of your computer’s CPUs is an easy way to achieve that.

Repeat processing at scale. At the other extreme, you might be running the same processing pipeline on multiple batches of data, over and over again, at a large scale. Chances are that you will have certain speed requirements, and since you’re running at scale the hardware costs—whether purchased or rented on the cloud—can be significant.

And given the need to reduce hardware costs, parallel computing is not sufficient.

When you need to scale

Again, I am assuming here that your processing is relatively easy to parallelize.

Optimize first

If you are going to have to scale, your first focus should not be on scaling, it should be on speeding up the software on a single CPU on a single machine. Yes, you should consider approaches that will scale later on, but going straight to scaling—whether via multiprocessor a cluster—can result in paying far higher hardware costs for no reason.

In many cases, by spending some time on optimization it’s possible to achieve significantly higher performance, and therefore correspondingly lower hardware costs.

Some examples:

  • Algorithmic optimization: Let’s say you need to find multiple string keys for each item in a long list of other strings.
    • A naive regex-based solution searching for key1|key2|key3|etc. will be quite slow.
    • Using the pyahocorasick library, which implements the Aho-Corasick algorithm, can give you a 25× speed-up.
    • Switching to the heavily optimized Rust Aho-Corasick library can give you a further 2× speedup, for a total of 50× speedup over the naive implementation.
  • Switching from Python to faster languages: The Pandas documentation gives an example where rewriting a function in Cython gives a 200× speedup.
  • Environmental configuration: You can speed up database-backed tests by disabling syncing to disk; in some situations a similar approach can speed up disk-I/O-heavy data processing.

Once you’ve optimized your code, then you can start thinking about scaling—and with any luck your hardware costs will be much lower than they would have been when you started.

Multiple CPUs before a cluster

In a world where you can trivially rent a machine with 96 CPUs, switching to a full cluster is often a step backwards. On a single machine moving data around is quite cheap; in a cluster it can get expensive, or you will be limited to data distribution models that don’t necessarily match your problem.

So before you start thinking about how to scale to multiple machines, see how much you can do on a single machine. In many cases running a full end-to-end job on a single machine is much simpler than distributing a job across multiple machines.

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks
A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage

Speeding up your software is situation-specific

If you need to scale, and your problem is easily parallelizable, optimizing first and then scaling is usually the best approach. But of course that is just one situation, and the tradeoffs will be different in other situations.

We already discussed the example of one-off jobs running on your personal computer. Then there’s problems where parallelizing is harder, in which case a fast single-CPU solution might be fundamentally different than a fast multiple-CPU solution, let alone a multiple-machine solution.

So whenever you’re thinking about speed, don’t just reach to a solution because you’ve used it before, or because you happen to a have a Spark cluster on premises, or because you read an article about how great it works. Instead, consider the specifics of your situation, and what your goals are, and then decide how to approach the problem.