Who controls parallelism? A disagreement that leads to slower code

If you’re using NumPy, Polars, Zarr, or many other libraries, setting a single environment variable or calling a single API function might make your code run 20%-80% faster. Or, more accurately, it may be that your code is running that much more slowly than it ought to.

The problem? A conflict over who controls parallelism: your application, or the libraries it uses.

Let’s see an example, and how you can solve it.

The mystery of the speedy single-thread implementation

We’re going to be using the following example to measure time spent in code using a single Python thread, a Python thread pool, and a Python process pool. All three variations will calculate the dot product of two arrays.

from multiprocessing import cpu_count, get_context
from multiprocessing.pool import ThreadPool
import numpy as np
from time import time

ARRAY = np.ones((1500, 1500))
NUM_OPS = cpu_count() * 3


def multiply(_):
    ARRAY.dot(ARRAY)

def single_thread():
    start = time()
    for i in range(NUM_OPS):
        multiply(None)
    print(
        f"Single Python thread: {time() - start:.1f} secs"
    )

def threadpool():
    with ThreadPool(cpu_count()) as pool:
        start = time()
        pool.map(multiply, [None for i in range(NUM_OPS)])
        print(f"Python thread pool: {time() - start:.1f}"
              + " secs")

def multiprocess():
    with get_context("spawn").Pool(cpu_count()) as pool:
        # Make sure all processes are up and running:
        pool.map((3).__add__, range(NUM_OPS))

        start = time()
        pool.map(multiply, [None for i in range(NUM_OPS)])
        print(f"Multiprocessing: {time() - start:.1f} secs")

When I run this on my 12 core / 20 hyperthread computer, I get the following result:

Single Python thread: 1.2 secs

Python thread pool: 1.5 secs

Multiprocessing: 2.2 secs

Now, in general we would expect a thread pool or process pool to be faster than a single threaded version of the code. It’s true, Python suffers from the Global Interpreter Lock, which can reduce parallelism when using threading, but the dot() API release the GIL. Plus, we’re also getting a slowdown with using multiple processes.

What’s going on? Why is one Python thread running faster than a thread pool or process pool?

A parallelism control conflict: OpenBLAS vs. your code

When NumPy runs linear algebra operations, it uses libraries implementing the BLAS standard to do the actual calculations. By default, it uses OpenBLAS. And out of the box OpenBLAS runs its operations in a thread pool.

If we add a little code to record how many threads are running in the process, or in the sum of subprocesses for the process pool, we can see how many threads are running for each alternative implementation:

Single Python thread: 1.1 secs
Actual threads: 20

Python thread pool: 1.5 secs
Actual threads: 43

Multiprocessing: 1.9 secs
Total threads in pool processes: 401

Now we see that the single-threaded version wasn’t single-threaded: it only had one Python thread, but OpenBLAS started a whole pool.

To get a better understanding of what’s going, let’s look at some visualizations. To simplify the diagrams, I am going to ignore hyperthreading, and just think through what would happen if I ran this code on a machine with 2 CPU cores.

A single Python thread

Here’s what processing looks like if we only have a single Python thread running a single dot() operation at a time:

G queue queue ot1 OpenBLAS thread 1 queue->ot1 ot2 OpenBLAS thread 2 queue->ot2 pt1 Main thread pt1->queue

NumPy calls an OpenBLAS routine, which breaks the work down into smaller tasks, adds them to a queue, and then 2 worker threads process the tasks. Then eventually these get assembled back into the final result. There are 2 cores and 2 threads doing work, so we saturate all available compute resources.

Python thread pool

If we’re using a Python thread pool, we actually try to process 2 different dot() operations at once. That means the OpenBLAS queue gets more operations in it, and they get handed back in some hard to predict order to the Python threads. I don’t know why this case is slower, but I’d hypothesize it has something to do with memory cache locality being worse as more threads get scheduled than in the previous case.

G queue queue pt1 Python thread 1 queue->pt1 pt2 Python thread 2 queue->pt2 queue2 queue2 ot1 OpenBLAS thread 1 queue2->ot1 ot2 OpenBLAS thread 2 queue2->ot2 pt Python main thread pt->queue pt1->queue2 pt2->queue2

Python process pool

Once we have a process pool, each process gets its own OpenBLAS thread pool. So again we try to do 2 dot() operations at once, but this time we have 4 OpenBLAS threads competing over 2 CPUs, so the Linux scheduler keeps evicting them and memory caches likely keep getting cleaned out.

G cluster_main Main process cluster_worker1 Worker process 1 cluster_worker2 Worker process 1 queue queue pt1 Python worker thread queue->pt1 pt2 Python worker thread queue->pt2 pt Python main thread pt->queue queue2 queue2 ot1 OpenBLAS thread 1.1 queue2->ot1 ot2 OpenBLAS thread 1.2 queue2->ot2 pt1->queue2 queue3 queue3 ot3 OpenBLAS thread 2.1 queue3->ot3 ot4 OpenBLAS thread 2.2 queue3->ot4 pt2->queue3

The problem: dual dueling thread/process pools

There are two places you can start thread pools or process pools: you can start them in your application code, or the library can start a thread pool of its own.

  • If you only create a single pool, either the application or the library, and you’re starting a thread or process per CPU cores, you can hopefully saturate all the cores while preserving cache locality.
  • But if both the application and the library start thread or process pools, you have a conflict. The number of threads and chunks of works processing in parallel is larger than the number of cores, so you start seeing CPU cache evictions as different threads and processes fight.

To deal with this situation, OpenBLAS allows you to disable the thread pools by setting an environment variable called OPENBLAS_NUM_THREADS to 1. With this setting, OpenBLAS work is done in the thread that ran the operation, rather than being dispatched to a thread pool.

When we set that, we get the following results:

Single Python thread: 7.7 secs
Actual threads: 1

Python thread pool: 1.1 secs
Actual threads: 24

Multiprocessing: 1.1 secs
Total threads in pool processes: 21

This time around we can see the single Python thread is actually the only thread—and the lack of parallelism makes everything slower. For the thread pool and process pool, we’re back to having about one thread or process per CPU.

Going back to our visualizations using a simplified 2-core CPU, we can see the impact of setting OPENBLAS_NUM_THREADS=1 on the threading model.

A single Python thread with OPENBLAS_NUM_THREADS=1

We’re only using 1 core, so things take longer:

G pt1 Main thread

Python thread pool with OPENBLAS_NUM_THREADS=1

We have 2 worker threads for the 2 cores:

G queue queue pt1 Python thread 1 queue->pt1 pt2 Python thread 2 queue->pt2 pt Python main thread pt->queue

Python process pool with OPENBLAS_NUM_THREADS=1

We have 2 worker processes for the 2 cores:

G cluster_main Main process cluster_worker1 Worker process 1 cluster_worker2 Worker process 1 queue queue pt1 Python worker thread queue->pt1 pt2 Python worker thread queue->pt2 pt Python main thread pt->queue

It’s not just OpenBLAS

The same issue we see here with OpenBLAS can also occur with other BLAS libraries; NumPy can use Intel’s MKL, for example. More broadly, many other Python libraries have built-in thread pools:

  • Polars,
  • the Zarr data format library (via Blosc, specifically for compression and decompression),
  • Numexpr,
  • and others.

The solution: just one pool

Instead of having 2N, or 3N, or (even worse!) N×N threads, so long as you are dealing purely with computation you probably only want a single pool (thread or process). With a single pool, you can have one thread per CPU core.

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks
A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage

Waiting for reading from the disk or network changes the calculation since you’re no longer waiting only on N CPU cores; some of your threads may be waiting on I/O. But focusing just on computation, that leaves us with two options:

Option 1: Single-threaded Python, libraries control the thread pools

In this option you never start any threads or processes in your Python application code. Instead, you only rely on the thread pools provided by the libraries you’re using.

For example, if you’re loading data with Zarr and then processing it with NumPy’s OpenBLAS routines, initially you’ll get parallelism from the Zarr thread pool, and then afterwards from the OpenBLAS thread pool. The two threads pools won’t compete for resources because all the Zarr processing will be finished by the time OpenBLAS is called.

If your library is multi-threaded by default, like Polars, this may be a fine option. But consider that NumPy is single-threaded for everything other than linear algebra operations that use BLAS. For libraries like NumPy that are single-threaded, or mostly so, this model won’t give you parallelism.

Option 2: Thread pool or process pool for Python started by your application

In this option, your application is responsible for a starting a thread pool or process pool, directly or via something like Dask. This gives you parallelism even if the libraries you rely on are single-threaded by default, so long as they either release the GIL or you’re using a process pool.

If you go this route, you want to make sure to disable parallelism in any libraries you’re using that expect to start their own thread pools. How to do so varies:

  • We saw you can control OpenBLAS with the OPENBLAS_NUM_THREADS=1 environment variable.
  • threadpoolctl works for BLAS and OpenMP-based libraries.
  • For other libraries you will need to check their documentation.