Python’s multiprocessing performance problem

Because Python has limited parallelism when using threads, using worker processes is a common way to take advantage of multiple CPU cores. The multiprocessing module is built-in to the standard library, so it’s frequently used for this purpose.

But while multiple processes let you take advantage of multiple CPUs, moving data between processes can be very slow. And that can reduce some of the performance benefits of using worker processes.

Let’s see:

  • Why processes can have performance problems that threads don’t.
  • A number of ways to work around or deal with this performance overhead.
  • A bad solution you don’t want to use.

Threads vs. processes

Multiple threads let you run code in parallel, potentially on multiple CPUs. On Python, however, the global interpreter lock makes this parallelism harder to achieve.

Multiple processes also let you run code in parallel—so what’s the difference between threads and processes?

All the threads inside a single process share the same memory address space. If thread 1 in a process stores some memory at address 0x7f0cd1a88810, thread 2 can access the same memory at the same address. That means passing objects between threads is cheap: you just need to get the pointer to the memory address from one thread to the other. A memory address is 8 bytes: this is not a lot of data to move around.

In contrast, processes do not share the same memory space. There are some shared memory facilities provided by the operating system, typically, and we’ll get to that later. But by default, no memory is shared. That means you can’t just share the address of your data across processes: you have to copy the data.

If you’re passing a little bit of data between processes, that’s fine; if you’re passing a 1GB DataFrame… that may start getting expensive.

Multiprocessing in Python

So far we’ve been talking about processes on the operating system level, where the available facilities essentially involving copying bytes: from a file, or shared memory, or a fancy hybrid of the two like mmap(). When you’re writing Python, though, you want to share Python objects between processes.

To enable this, when you pass Python objects between processes using Python’s multiprocessing library:

  1. On the sender side, the arguments get serialized to bytes with the pickle module.
  2. On the receiver side, the bytes are unserialized using pickle.

This serialization and deserialization process involves computation, which can potentially be slow. Let’s try an example, comparing a thread pool to a process pool:

from time import time
import multiprocessing as mp
from multiprocessing.pool import ThreadPool
import numpy as np
import pickle

def main():
    arr = np.ones((1024, 1024, 1024), dtype=np.uint8)
    expected_sum = np.sum(arr)

    with ThreadPool(1) as threadpool:
        start = time()
        assert (
            threadpool.apply(np.sum, (arr,)) == expected_sum
        )
        print("Thread pool:", time() - start)

    with mp.get_context("spawn").Pool(1) as processpool:
        start = time()
        assert (
            processpool.apply(np.sum, (arr,))
            == expected_sum
        )
        print("Process pool:", time() - start)

if __name__ == "__main__":
    main()

If we run this we get the following:

$ python threads_vs_processes.py
Thread pool: 0.3097844123840332
Process pool: 1.8011224269866943

Running the code in a subprocess is much slower than running a thread, not because the computation is slower, but because of the overhead of copying and (de)serializing the data. So how do you avoid this overhead?

Reducing the performance hit of copying data between processes

Option #1: Just use threads

Processes have this overhead, threads do not. And while it’s true that generic Python code won’t parallelize well when using multiple threads, that’s not necessarily true for your Python code. For example, NumPy releases the GIL for many of its operations, which means you can use multiple CPU cores even with threads.

For example:

import numpy as np
from time import time
from multiprocessing.pool import ThreadPool

arr = np.ones((1024, 1024, 1024))

start = time()
for i in range(10):
    arr.sum()
print("Sequential:", time() - start)

expected = arr.sum()
start = time()
with ThreadPool(4) as pool:
    result = pool.map(np.sum, [arr] * 10)
    assert result == [expected] * 10
print("4 threads:", time() - start)

When run, we see that NumPy uses multiple cores just fine when using threads, at least for this operation:

$ python numpy_gil.py
Sequential: 4.253053188323975
4 threads: 1.3854241371154785

In situations where parallelism is possible with Python threads, like using much of NumPy’s API, there’s much less motivation to use processes. For more details, read this introduction to the GIL.

Pandas is built on NumPy, so many numeric operations will likely release the GIL as well. However, anything involving strings, or Python objects in general, will not. So another approach is to use a library like Polars which is designed from the ground-up for parallelism, to the point where you don’t have to think about it at all, it has an internal thread pool.

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks
A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage

Option #2: Live with it

If you’re stuck with using processes, you might just decide to live with the overhead of pickling. In particular, if you minimize how much data gets passed and forth between processes, and the computation in each process is significant enough, the cost of copying and serializing data might not significantly impact your program’s runtime. Spending a few seconds on pickling doesn’t really matter if your subsequent computation takes 10 minutes.

In addition, it’s worth noting that Python has a faster version of pickling that is not enabled by default as of 3.11; it might be enabled in a future version. This will reduce the overhead of pickling somewhat, though it will still exist.

Option #3: Write the data to disk

Instead of passing data directly, you can write the data to disk, and then pass the path to this file to the subprocess (as an argument) or parent process (as the return value of the function running in the worker process). The recipient process can then parse the file.

Here’s an example comparing passing the DataFrame directly to passing it using a temporary Parquet file:

import pandas as pd
import multiprocessing as mp
from pathlib import Path
from tempfile import mkdtemp
from time import time


def noop(df: pd.DataFrame):
    # real code would process the dataframe here
    pass


def noop_from_path(path: Path):
    df = pd.read_parquet(path, engine="fastparquet")
    # real code would process the dataframe here
    pass


def main():
    df = pd.DataFrame({"column": list(range(10_000_000))})
    with mp.get_context("spawn").Pool(1) as pool:
        # Pass the DataFrame to the worker process
        # directly, via pickling:
        start = time()
        pool.apply(noop, (df,))
        print("Pickling-based:", time() - start)

        # Write the DataFrame to a file, pass the path to
        # the file to the worker process:
        start = time()
        path = Path(mkdtemp()) / "temp.parquet"
        df.to_parquet(
            path,
            engine="fastparquet",
            # Run faster by skipping compression:
            compression="uncompressed",
        )
        pool.apply(noop_from_path, (path,))
        print("Parquet-based:", time() - start)


if __name__ == "__main__":
    main()

If we run it, we can see the Parquet version is indeed faster:

$ python tofile.py
Pickling-based: 0.24182868003845215
Parquet-based: 0.17243456840515137

Parquet may or may not be faster in all situations, of course, and pickle will likely run faster in future versions of Python, but this sort of approach might be helpful in some situations.

Option #4: multiprocessing.shared_memory

Because processes sometimes do want to share memory, operating systems typically provide facilities for explicitly creating shared memory between processes. Python wraps this facilities in the multiprocessing.shared_memory module.

However, unlike threads, where the same memory address space allows trivially sharing Python objects, in this case you’re mostly limited to sharing arrays. And as we’ve seen, NumPy releases the GIL for expensive operations, which means you can just use threads, which is much simpler. Still, in case you ever need it, it’s worth knowing this module exists.

Note: The module also includes ShareableList, which is a bit like a Python list but limited to int, float, bool, small str and bytes, and None. But this doesn’t help you cheaply share an arbitrary Python object.

A bad option for Linux: the "fork" context

You may have noticed we did multiprocessing.get_context("spawn").Pool() to create a process pool. This is because Python has multiple implementations of multiprocessing on some OSes. "spawn" is the only option on Windows, the only non-broken option on macOS, and available on Linux. When using "spawn", a completely new process is created, so you always have to copy data across.

On Linux, the default is "fork": the new child process has a complete copy of the memory of the parent process at the time of the child process’ creation. This means any objects in the parent (arrays, giant dicts, whatever) that were created before the child process was created, and were stored somewhere helpful like a module, are accessible to the child. Which means you don’t need to pickle/unpickle to access them.

Sounds useful, right? There’s only one problem: the "fork" context is super-broken, which is why it will stop being the default in Python 3.14.

Consider the following program:

import threading
import sys
from multiprocessing import Process

def thread1():
    for i in range(1000):
        print("hello", file=sys.stderr)

threading.Thread(target=thread1).start()

def foo():
    pass

Process(target=foo).start()

On my computer, this program consistently deadlocks: it freezes and never exits. Any time you have threads in the parent process, the "fork" context can cause in potential deadlocks, or even corrupted memory, in the child process.

You might think that you’re fine because you don’t start any threads. But many Python libraries start a thread pool on import, for example NumPy. If you’re using NumPy, Pandas, or any other library that depends on NumPy, you are running a threaded program, and therefore at risk of deadlocks, segfaults, or data corruption when using the "fork" multiprocessing context. For more details see this article on why multiprocessing’s default is broken on Linux.

So in theory this is an option on Linux, but in practice you really don’t want to use it. Thus I won’t bother showing you how to pass data across processes this way. If you really want to know, there are articles elsewhere demonstrating this, but you’re just shooting yourself in the foot if you take this approach.

You need to measure

Before you spend too much time on trying to fix this particular performance issue, you really should measure your software’s performance and figuring out where its actual bottlenecks are. It’s quite possible that threading actually works just fine (option #1), or that the extra overhead from communicating across processes doesn’t matter (option #2).

You will only know if you profile your software and figure out what the actual bottlenecks are.