When Python can’t thread: a deep-dive into the GIL’s impact

Most computers these days come with multiple cores, allowing multiple threads to run computations in parallel. And even without multiple cores, you can have concurrency, for example one thread waiting on disk while another runs code on the CPU. The ability to use parallelism can be critical to scaling your application—or making your data processing finish faster.

Unfortunately, in many cases Python can only run one thread at a time, due to what’s know as the Global Interpreter Lock (“GIL”). Other times it can run multiple threads just fine—it all depends on the specific usage patterns.

But which usage patterns allow parallelism, and which don’t? Naive mental models will give you inaccurate answers. So in this article you’ll build a practical mental model of how the GIL works:

  • We’ll start by going through a series of increasingly more accurate mental models of how the GIL works.
  • Then, we’ll see how our new, more accurate mental model can help you predict where and whether parallelism bottlenecks will occur.

When does a Python thread need to hold the GIL?

The GIL is an implementation detail of the CPython interpreter, a threading lock: only one thread can acquire the lock at a given time. So to understand how the GIL impacts Python’s ability to run code in parallel in threads, we need to answer a key question: when does a Python thread need to hold the GIL?

To understand that, we’ll build a series of increasingly more accurate mental models of how the GIL works.

Model #1: Only one thread can run Python code at a time

Consider the following code; it runs the function go() in two threads:

import threading
import time

def go():
    start = time.time()
    while time.time() < start + 0.5:
        sum(range(10000))

def main():
    threading.Thread(target=go).start()
    time.sleep(0.1)
    go()

main()

When we run it using the Sciagraph performance profiler, here’s what the execution timeline looks like:

Notice how the threads switch back and forth between waiting and running on the CPU: the running code holds the GIL, the waiting thread is waiting for the GIL.

If the GIL hasn’t been released for 5ms (or some other configurable interval), Python tells the currently running thread to release the GIL. Then, the next thread can run. With two threads we see it switching back and forth; the actual shown interval is longer than 5ms because this is a sampling profiler, taking samples only every 47ms or so.

So that’s our initial mental model:

  1. A thread must hold the GIL to run Python code.
  2. Other threads can’t acquire the GIL, and therefore can’t run, until the currently running thread releases it, which happens every 5ms automatically.

Model #2: Releasing the GIL every 5ms is not guaranteed

The GIL is released every 5ms by default in Python 3.7 to 3.10 (this may change in the future), allowing other threads to run:

>>> import sys
>>> sys.getswitchinterval()
0.005

But this GIL release is best-effort, it is not guaranteed. Consider a pseudo-code sketch of a naive interpreter loop running in a thread; Python’s implementation is rather different, but the same principle applies:

while True:
    if time_to_release_gil():
        temporarily_release_gil()
    run_next_python_instruction()

So long as run_next_python_instruction() isn’t finished, temporarily_release_gil() won’t get called. Most of the time this doesn’t happen because individual operations (adding two integers, appending to a list, and so on) finish quickly. The interpreter can therefore frequently check if it’s time to release the GIL.

That being said, long running operations can prevent the GIL from being released automatically. Let’s write a small Cython extension, a Python-like language which gets compiled down to C. It calls the sleep() function in the standard C library:

cdef extern from "unistd.h":
    unsigned int sleep(unsigned int seconds)

def c_sleep(unsigned int seconds):
    sleep(seconds)

We can compile it to an importable Python extension with the cythonize tool included with Cython:

$ cythonize -i c_sleep.pyx
...
$ ls c_sleep*.so
c_sleep.cpython-39-x86_64-linux-gnu.so

We’ll call it from a Python program, that tries to call c_sleep() in a thread in parallel to a CPU computation on the main thread:

import threading
import time
from c_sleep import c_sleep

def thread():
    c_sleep(2)

threading.Thread(target=thread).start()

start = time.time()
while time.time() < start + 2:
    sum(range(10000))

Here’s the Sciagraph performance profiler output from running it:

The main thread was unable to run until the sleeping thread finished; it seems the GIL was never released at all by the sleeping thread. This is because the call to c_sleep(2) doesn’t return for 2 seconds. Until those 2 seconds are up, the Python interpreter loop isn’t running, and therefore won’t check to see if it should automatically release the GIL.

Here’s our updated mental model:

  1. A Python thread must hold the GIL to run code.
  2. Other Python threads can’t acquire the GIL, and therefore can’t run, until the currently running thread releases it, which happens every 5ms automatically.
  3. Long-running (“blocking”) extension code prevents the automatic switching.

Model #3: Non-Python code can explicitly release the GIL

If we run time.sleep(3), that will do nothing for 3 seconds. We saw above that long-running extension code can prevent the GIL from being automatically switched between threads. So does that mean other threads can’t run at the same time as time.sleep()?

Let’s try with the following code, which tries to run a 3-second sleep in parallel to a 5-second computation in the main thread:

import threading
from time import time, sleep

program_start = time()

def thread():
    sleep(3)
    print("Sleep thread done, elapsed:", time() - program_start)

threading.Thread(target=thread).start()

# Do 5-second calculation in main thread:
calc_start = time()
while time() < calc_start + 5:
    sum(range(10000))
print("Main thread done, elapsed:", time() - program_start)

If we run it, we see:

$ time python gil2.py 
Sleep thread done, elapsed: 3.0081260204315186
Main thread done, elapsed: 5.000330924987793
real    0m5.068s
user    0m4.977s
sys     0m0.011s

If only one thread were running at a time, we’d expect the program to take 8 seconds, 3 for the sleep and 5 for calculation. That means the sleeping thread and the main thread ran in parallel!

Here’s the Sciagraph performance profiler output:

If we dig into the implementation of time.sleep, we can see what’s going on:

        int ret;
        Py_BEGIN_ALLOW_THREADS
#ifdef HAVE_CLOCK_NANOSLEEP
        ret = clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &timeout_abs, NULL);
        err = ret;
#elif defined(HAVE_NANOSLEEP)
        ret = nanosleep(&timeout_ts, NULL);
        err = errno;
#else
        ret = select(0, (fd_set *)0, (fd_set *)0, (fd_set *)0, &timeout_tv);
        err = errno;
#endif
        Py_END_ALLOW_THREADS

If we look at the documentation for PY_BEGIN/END_ALLOW_THREADS, we’ll see that this is one way to release the GIL. The C implementation is explicitly releasing the GIL while calling the underlying operating system sleep function. This is another way the GIL gets released that is separate from the automated every-5ms switch we’ve seen so far.

Any code that has released the GIL and isn’t trying to acquire it—in this case for the duration of the sleep—does not block other threads that do want the GIL. As a result, we can run as many threads as we want in parallel, so long as they explicitly release the GIL.

So here’s our new mental model:

  1. A thread must hold the GIL to run Python code.
  2. Other threads can’t acquire the GIL, and therefore can’t run, until the currently running thread releases it, which happens every 5ms automatically.
  3. Long-running (“blocking”) extension code prevents the automatic switching.
  4. Python extensions written in C (or other low-level languages) can however explicitly release the GIL, allowing one or more threads to run in parallel to the GIL-holding thread.

Model #4: Calls into the Python C API require the GIL

So far we’ve said the C code is able to release the GIL under some circumstances, but we haven’t said when. Because the GIL is needed to protect access to internal implementation details of the CPython interpreter, the answer is more-or-less: you must hold the GIL whenever calling a CPython C API.

For example, let’s say you want to construct a Python dict object using C (or Rust, or Cython). If you look through the CPython C API for creating dictionaries, you can see you’d be using functions like PyDict_New and PyDict_SetItem. If you’re writing C, you’d be calling them directly; in the case of Rust and Cython, the PyO3 library or generated code respectively would end up calling these APIs.

Almost every CPython C API requires the calling thread to hold the GIL, with some rare exceptions.

So here’s our final mental model:

  1. A thread must hold the GIL to call CPython C APIs.
  2. Python code running in the interpreter, like x = f(1, 2), uses those APIs. Every == comparison, every integer addition, every list.append: it’s all calls into Python C APIs. Thus threads running Python code must hold on to the lock when running.
  3. Other threads can’t acquire the GIL, and therefore can’t run, until the currently running thread releases it, which happens every 5ms automatically.
  4. Long-running (“blocking”) extension code prevents the automatic switching.
  5. Python extensions written in C (or other low-level languages) can however explicitly release the GIL, allowing one or more threads to run in parallel to the GIL-holding thread.

The parallelism implications of the GIL

So how does all this impact Python’s ability to run code in multiple threads at once? It varies quite a lot, depending on what code you’re considering.

The good scenario: Long-running C APIs that release the GIL

The best case scenario for parallelism is long-running C/C++/Rust/etc. code that spends most of its time not using CPython C APIs. It can then release the GIL and allow other threads to run. But even here, there are limits to be aware of.

For example, let’s consider an extension module written in C or Rust that lets you talk to a PostgreSQL database server.

Conceptually, handling a SQL query with this library will go through three steps:

  1. Deserialize from Python to the internal library representation. Since this will be reading Python objects, it needs to hold the GIL.
  2. Send the query to the database server, and wait for a response. This doesn’t need the GIL.
  3. Convert the response into Python objects. This needs the GIL again.

As you can see, how much parallelism you can get depends on how much time is spent in each step. If the bulk of time is spent in step 2, you’ll get parallelism there. But if, for example, you run a SELECT and get a large number of rows back, the library will need to create many Python objects, and step 3 will have to hold GIL for a while.

Bad scenario #1: “pure” Python code

“Pure” Python code, interacting with built-in Python objects like dicts, integers, lists, etc., and lacking calls into blocking low-level code, will be constantly using the Python C API:

l = []
for i in range(i):
    l.append(i * i)

As such, you’re not going to get any parallelism. But you will get threads switching back and forth every 5ms or so, so at least you’ll make progress on all threads.

Bad scenario #2: Long-running C/Rust APIs, but author forgot to release GIL

Even if it’s possible to release the GIL, the author of a low-level C/Rust/Cython/etc. library may have forgotten to do so. In that case you won’t get any parallelism. Even worse, if the code takes more than 5ms to run, you also won’t get automatic switching, so other threads won’t make any progress.

If you suspect this is a problem, for example based on profiling output, you can identify these sort of problems by:

  • Using the gil_load utility to figure out if the GIL really is a bottleneck.
  • Reading the source code of the extension and looking for a lack of GIL release, or by using the techniques described in the article Tracing the Python GIL.

Bad scenario #3: Low-level code with pervasive Python C API usage

Another case where you won’t get much parallelism is C/Rust extensions where usage of the Python C API is pervasive. Consider, for example, a JSON parser that is reading the following string:

[1, 2, 3]

The parser will:

  1. Read a few bytes, and then create a Python list.
  2. Then it will read a couple more bytes, and then create a Python integer and append it to the list.
  3. This continues until it runs out of data.

Creating all these Python objects requires using the CPython C APIs, and therefore requires holding the GIL. Turning the GIL on and off repeatedly, or on some sort of schedule, has a performance cost, and most JSON documents can be parsed very quickly. Given the above, the natural choice for the author of a JSON parser is to never release the GIL.

Let’s validate this hypothesis by observing how Python’s built-in JSON parser impacts parallelism when we read two large documents in two threads. Here’s the code:

import json
import threading

def load_json():
    with open("large.json") as f:
        return json.load(f)

threading.Thread(target=load_json).start()

load_json()

And here’s the execution timeline:

Sadly, there is no parallelism: the two JSON loads essentially don’t (and can’t) run in parallel.

Avoiding the GIL

As you can see in the example scenarios above, even in situations where the GIL can be released, you will still hit limits in parallelism when you need to interact with Python objects. If you’re converting a SQL response into Python objects, a large response means a lot of time without parallelism. If this becomes a bottleneck, one option is to switch to multiple processes.

Or, if you’re writing your own low-level extensions to Python, you can adopt a design pattern you will see in libraries like NumPy and Pandas. As much as possible, use an internal data representation that is not composed of Python objects.

A NumPy array of integers is very much not a Python list containing Python integers; it’s a more efficient representation that requires no Python C API usage, not to mention using much less memory. By minimizing interaction with Python objects, and only having one Python object exposed (for example, the NumPy array), the GIL can be released for as long as possible.