The Parallelism Blues: when faster code is slower
When you’re doing computationally intensive calculations with NumPy, you’ll want to use all your computer’s CPUs. Your computer has 2 or 4 or even more CPU cores, and if you can use them all then your code will run faster.
Except, of course, when parallelism makes your code run slower.
As it turns out, for certain operations NumPy will parallelize operations transparently. And if you’re not careful, this can actually slow down your code.
Parallelism makes it faster
Consider the following program:
import numpy as np a = np.ones((4096, 4096)) a.dot(a)
If we run this under the
time utility, we see:
$ time python dot.py real 0m1.546s user 0m4.171s sys 0m0.537s
As I explain elsewhere in more detail, the real time is the wall-clock time, the user time is CPU time. Since in this case user time is higher than wall-clock time, that means the operation used multiple CPUs, for a total of 4.17 CPU seconds.
So that’s great! If we’d only used one CPU, this operation would have taken ~4.2 seconds, but thanks to multiple CPUs it took only ~1.5 seconds.
Parallelism makes it slower
Let’s verify that assumption.
We’ll tell NumPy to use only one CPU, by using the threadpoolctl library:
import numpy as np from threadpoolctl import threadpool_limits with threadpool_limits(limits=1, user_api='blas'): a = np.ones((4096, 4096)) a.dot(a)
And now when we run it:
$ time python dot_onecpu.py real 0m3.654s user 0m3.652s sys 0m0.403s
When we used multiple CPUs it took ~4.2 CPU seconds, but with a single CPU it took ~3.7 CPU seconds. Measured by CPU time, the code is now faster!
Does this matter? Isn’t a faster wall-clock time what matters?
If you’re only running this one program on your computer, and you don’t have any other parallelism implemented for your program, then yes, this is fine. But if you are implementing some form of parallelism yourself, for example by using multiprocessing, joblib, or my personal favorite Dask, the default parallelism will make your program slower overall.
In this example, every
dot() call will take 13% more of your overall CPU capacity.
You’ll notice in the code above that the thread pool limits referred to BLAS.
BLAS is an API for linear algebra that NumPy uses for some of its operations, in this case the
The are different BLAS implementations available, and in the example above I was using OpenBLAS.
Another alternative is
mkl, which is provided by Intel and therefore is optimized for Intel processors; you don’t want to use it on AMD.
For this operation, at least,
mkl seems to have the same issue: it runs faster on a single CPU than it does when parallelized to multiple CPUs.
In general it’s worth seeing if switching will give you some performance improvement.
If you’re using Conda Forge, you can do that by having either the package
blas=*=openblas or the package
blas=*=mkl in your
What if it’s just this benchmark?
One could argue that this is just one benchmark, and there a variety of ways I could have screwed it up. And while that’s true, this is in the end just an example:
- The OpenBLAS issue tracker lists many cases where multi-threading slows things down; in some cases disabling multi-threading was the fix.
- From a broader perspective, switching to multiple threads always has costs and overhead (and see also Amdahl’s Law).
It would be extremely surprising, then, if running with N threads actually gave ×N performance.
So, yes, you often do want parallelism, but you need to think about where and how and when you use it.
Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.
Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.
Reduce parallelism to get more parallelism
If you’re implementing a high-level form of parallelism with Dask or the like, you might want to disable the multi-threading in NumPy. Individual operations will use less CPU time overall, and your own parallelism will ensure utilization of multiple CPUs.
In addition, be careful when profiling: if you’re using automatic parallelization, what you’re profiling might not match the behavior on a different computer with a different number of CPUs.
Consulting services: take your code from prototype to production
You have a working Python prototype for your data processing algorithm. Now you need to get it ready for production. Which means your software needs to be fast, robust, maintainable, cost-efficient, and scalable.
With more than 25 years experience of shipping software to production, I can help you:
- Speed up your code so it can get results on time, and run at scale with an affordable operating budget.
- Learn about tools, techniques, and process improvements that will help you ship best-practices software, on schedule.
To get in touch about consulting services, send me an email at email@example.com.
Speed up your Python code with skills you can use at your job
Sign up for my newsletter, and join over 7400 Python developers and data scientists learning practical tools and techniques, from Python performance to Docker packaging, with a free new article in your inbox every week.