Understanding CPUs can help speed up Numba and NumPy code

When you need to speed up your NumPy processing—or just reduce your memory usage—the Numba just-in-time compiler is a great tool. It lets you write Python code that gets compiled at runtime to machine code, allowing you to get the kind of speed improvements you’d get from languages like C, Fortran, or Rust.

Or at least, that’s the theory. In practice, your initial Numba code may be no faster than the NumPy equivalent.

But you can do better, once you have a better understanding of how CPUs work. And this knowledge will help you more broadly with any compiled language.

In this article we’ll:

  • Consider a simple image-processing problem.
  • Try, and initially fail, to speed it up with Numba.
  • We’ll review just a little bit how modern CPUs are so fast, and the limits of compilers.
  • Based on our new understanding, we’ll then show how we can tweak our code to run 25× faster than our original version.

The problem: removing noise from an image

Imagine you’re getting 16-bit images from a digital microscope, and you only care about the bright parts of the image. Dark areas have no information, and have lots of noise. If we want to clean those areas up, a simplistic algorithm is setting all values below a certain threshold to complete black, i.e. 0.

Here’s how we’d do it with NumPy:

image[image < 1000] = 0

To see how fast this runs, I generated a simulated image:

import numpy as np
rng = np.random.default_rng(12345)
noise = rng.integers(0, high=1000, size=(4096, 4096), dtype=np.uint16)
signal = rng.integers(0, high=5000, size=(4096, 4096), dtype=np.uint16)
image = noise | signal

And then timed it using %timeit magic available in Jupyter and IPython, with turboboost disabled on my CPU. The code mutates the image, so we want to operate on a copy, and copying adds overhead, so we also measure that overhead:

>>> %timeit image2 = image.copy()
5.84 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit image2 = image.copy(); image2[image2 < 1000] = 0
54.2 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It takes around 48ms (54ms - 6ms) to run this simplistic algorithm using NumPy. Can we do better?

Attempt #0: The original algorithm, translated to Numba

Numba allows us to write code in a subset of Python and have it compile to native code. To implement our algorithm, as a first pass we can do the following:

from numba import njit

@njit
def remove_noise_numba_0(arr, noise_level):
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            if arr[i, j] < noise_level:
                arr[i, j] = 0

remove_noise_numba_0(image.copy(), 1000)

We make sure to run the function once so that timing runs later on don’t include the one-off time to compile the code.

If we measure this with %timeit, it takes 46ms, not much better than our NumPy code.

Attempt #1: Choosing good types

While Numba visually looks like Python, it’s best to think of it as C or another statically-typed compiled language. It will generate code specifically for the types of the data passed in.

For the array, we know the type: a 2D array of unsigned 16-bit integers. But the type of the noise_level is something Numba has to guess, since we passed in a Python integer and it will want to convert it to a primitive type, the equivalent of some NumPy dtype.

We know however that it should also be an unsigned 16-bit integer (the noise level can’t be negative, or larger than the brightest spot!), so we can tweak the code to enforce that:

@njit
def remove_noise_numba_1(arr, noise_level):
    noise_level = arr.dtype.type(noise_level)
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            if arr[i, j] < noise_level:
                arr[i, j] = 0

remove_noise_numba_1(image.copy(), 1000)

While not intended to speed things up—the goal was correctness—this change happens to improve the speed to 33ms. Can we do better?

To answer that, let’s consider how CPUs work.

A better mental model for modern CPUs

First, everything we’re talking about here involves a single CPU core. We’re completely ignoring the possibility for parallelizing to multiple cores, and just assuming a single thread running on a single core.

A first pass mental model of CPUs is that they read machine instructions and then execute them one by one.

While easy to understand, and sufficient for many purposes, if you want to write faster code you need a more accurate mental model. The one-instruction-at-a-time model implies executing more instructions will make the code slower based on the number of instructions added. That is not always the case. Instead, different instructions can have very different performance impacts.

Modern CPUs can run instructions in parallel

In fact, modern CPUs you’ll encounter on desktops, laptops, and servers can run multiple instructions at the same time, so long as the instructions don’t interfere with each other. Consider the following code; pretend it’s in C or some other compiled language, so that expressions map in a straightforward manner to CPU instructions:

a = a + 1;
b = b + 1;

The two lines of code are unrelated to each other. If translated to machine code, the CPU may well be able to run both more or less at the same time. Again, this has nothing to do with multiple CPU cores or multiple threads: this is all a single thread on a single core.

Now consider what happens in the following code:

a = a + 1;
if (a > 100) {
    b = b + 1;
}

In this example, incrementing b depends on the value of a. That means the third line of code can’t be executed until the first line is done and then the comparison is done. No more parallelism, and our code is suddenly much slower!

Modern CPUs do speculative execution and branch prediction

In order to get back some of that missing parallelism, modern CPUs will execute later code—and predict which branch will be taken in a condition—based on the assumption that most code is actually quite predictable. Consider:

int a = 0;
int b = 0;
while (true) {
    a = a + 1
    if (a > 100) {
        b = b + 1;
    }
}

For the first hundred iterations, b will never be incremented. After that, b will always be incremented. Other than during the transition period, the CPU can likely predict what will happen next and speculatively choose which code to execute.

What happens if the CPU guesses wrong? It undoes the work, which can be expensive, which means that failed branch predictions can be a significant hit to performance.

Modern CPUs have special CPU instructions for doing parallel working: SIMD

In addition to transparently parallelizing instructions, modern CPUs have SIMD instructions: Single Instruction Multiple Data. The idea is that you can do the same operation to multiple entries in an array with a single CPU instruction.

For example, here we manually add 100 to four array entries, in a loop:

int array[4] = {1, 2, 3, 4};
for (int i=0; i < 4; i++) {
    array[i] += 100;
}

Instead, with SIMD we can use a special CPU instruction to add 100 to all four array items at once. Even if the instruction is slightly slower, since we’re swapping four instruction (and maybe part of the loop) for a single instruction, overall we can expect the code to run much faster.

Supporting these special instructions is possible because we’re doing the same operation to all four values, in this case adding a constant. Adding two arrays to each other would again be the same operation happening to all items. The key is in the name: single instruction, multiple data.

You can explicitly use SIMD instructions, but Numba doesn’t support that, and moreover that usually makes your code CPU-specific. ARM CPUs as used on newer Macs have different SIMD instructions than x86-64 CPUs, and even x86-64 CPUs have different SIMD instructions depending on the model.

Alternatively, you can hope your compiler is smart enough to use SIMD automatically in relevant places. In Numba’s case, because it compiles on the fly for your computer, it can use whatever SIMD instructions your CPU happens to support.

Helping the compiler: linear code is fast code

Whether it’s automatic CPU instruction parallelism or SIMD-based parallelism, if statements and other conditionals are a problem.

  • For instruction parallelism, conditional code means the CPU will need to guess what happens next, and it might guess wrong.
  • For SIMD, these instructions typically want to be doing the same thing to all items in a small array. A conditional makes it a lot harder to do the same thing.

If we want to make our code faster, we want to try to make it as linear as possible, without if statements or other branching conditions. This will both make it more likely the compiler can identify opportunities for using SIMD, and make it easier for the CPU to run the resulting instructions in parallel.

Inspecting our code with perf

We’ve learned that conditional code makes it harder for the compiler to write fast code, and makes it harder for the CPU to run the resulting code quickly. Here’s our current Numba code:

@njit
def remove_noise_numba_1(arr, noise_level):
    noise_level = arr.dtype.type(noise_level)
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            if arr[i, j] < noise_level:
                arr[i, j] = 0

There’s definitely a conditional there! The code is sometimes writing to the array, and sometimes not, depending on the comparison.

We can measure some of this impact using the perf tool on Linux, using the following decorator to connect to the current process and get data about the CPU during the interval it’s attached:

from subprocess import Popen
from contextlib import contextmanager
from os import getpid
from time import sleep
from signal import SIGINT

@contextmanager
def perf_stat():
    p = Popen(["perf", "stat", "-p", str(getpid())])
    sleep(0.5)
    yield
    p.send_signal(SIGINT)

We can run this on our code:

image2 = image.copy()
with perf_stat():
    remove_noise_numba_1(image2, 1000)

The result is a lot of information, but for our purposes we’ll look at two numbers: percentage of bad speculation and percentage of branch misprediction.

...
52.8% Bad Speculation
52.8% Branch Mispredict
...

That’s a lot of misprediction! And it’s not surprising once we think about what our algorithm is doing: comparing a bunch of noisy numbers against a value and seeing if they are higher or lower than a threshold. That’s very hard to predict for messy images.

Attempt #2: Getting rid of conditionals

We want to get rid of our conditional. One way to do so is to come up with an equivalent calculation that always runs the same code, without branching or conditionals. A useful keyword here is “branchless” code, and a little searching brought me (via Wikipedia) to this useful page on branchless saturating arithmetic. That is, we want to subtract a value, and if it goes below zero to stick to zero instead of wrapping around, which is the default for unsigned integers.

After a few annoying minutes of debugging, I came up with the following code. It gives the same results, without any branching. Unfortunately, it’s much harder to understand: it relies on a bunch of tricks related to the way integers are represented in CPU memory.

@njit
def remove_noise_numba_2(arr, noise_level):
    noise_level = arr.dtype.type(noise_level)
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            res = arr[i, j] - noise_level
            mask_to_zero_if_wrapped = -(res <= arr[i, j])
            res = (res & mask_to_zero_if_wrapped) + (
                   noise_level & mask_to_zero_if_wrapped)
            arr[i, j] = res

remove_noise_numba_2(image.copy(), 1000)

If CPUs really ran one instruction at a time, this would likely be slower than our previous attempt: we’re doing writes on all entries, for example. But that is not how CPUs work these days, and so this version is much faster.

In fact, we’re down to just 4ms.

And if we measure it with perf, we can see that we’re no longer suffering from branch mispredictions:

...
2.4% Bad Speculation
2.4% Branch Mispredict
...

Attempt #3: Helping the compiler help us

We’ve made our code much faster by getting rid of conditionals—at the cost of obscure, ugly, and hard-to-maintain code. Can we do better?

Mathematically, our two versions are the same. We can assume compilers will try to avoid conditionals when it’s obvious it will slow down the CPU. And compilers are certainly very good at turning inefficient math expressions into efficient ones.

So why couldn’t the compiler do so in this case? Let’s look at the inefficient version again:

@njit
def remove_noise_numba_1(arr, noise_level):
    noise_level = arr.dtype.type(noise_level)
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            if arr[i, j] < noise_level:
                arr[i, j] = 0

The problem is that we’re only doing conditional writes: for numbers higher than the noise level, we don’t do anything. My mental model of compiler authors is that while they’re happy to remove unnecessary memory reads, they will conservatively refuse to add additional memory writes that you didn’t ask for.

So what if we wrote to memory in both cases, but otherwise just kept a conditional?

@njit
def remove_noise_numba_3(arr, noise_level):
    noise_level = arr.dtype.type(noise_level)
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            arr[i, j] = (
                arr[i, j] if arr[i, j] >= noise_level
                else 0
            )

remove_noise_numba_3(image.copy(), 1000)

Yes, there’s still a conditional, but it’s a conditional inside a fairly simple calculation that doesn’t affect writes to memory. We still write to the same memory address on both sides of the condition. A smart compiler will notice that there’s an equivalent mathematical expression that doesn’t require a conditional, and switch to using it if it thinks it’ll speed up the code.

When we run this version the runtime is 2ms, the fastest version yet! Whatever expression the compiler found is even more efficient than the hacky manual bit twiddling we did above.

What we’ve learned, and further discussion

Here’s the performance of all the versions we’ve considered:

Version Runtime, lower is better
NumPy 48 ms
Numba #1, conditional writes 33 ms
Numba #2, ugly bit twiddling 4 ms
Numba #3, unconditional writes 2 ms

We’ve managed to get code that is 25× faster, but still quite readable! We did so by:

  1. Switching to Numba, so we have more control over the execution.
  2. Making sure all the types are explicitly specified and the same (np.uint16 in this case.)
  3. Removing conditional writes; this allowed the compiler to switch to a linear implementation without branches.

There was no threading or multiprocessing involved: this is all single-threaded code on a single CPU core.

Results may vary based on compiler version and CPU

On the one hand, all compilers try to write code that will be efficient on the target CPU, so similar approaches will likely work in other languages, not just Numba. In fact this article was partially inspired by this article by Alex “matklad” Kladov, which uses Rust.

That being said, different compiler versions and implementations behave differently, changing how they do optimization. So that can make a difference. So code which gets automatically optimized in one version might regress in another; if speed is critical, it’s important to have benchmarks so you can catch regressions before they’re deployed.

For ahead-of-time compilation as used by C, Rust, Cython, and so on, typically the compiler targets a subset of available features so the code will be portable across CPU models. You can choose to target your current CPU, or a specific set of features, to maximize performance in cases where you know you will only be running on specific hardware. Numba compiles just-in-time, for your computer, so it is always targeting the specific capabilities of the current computer’s CPU. That means different computers can give different results.

Case in point: reader Nadav Horesh tested the code from this article on a Xeon CPU that supports AVX-512 SIMD instructions, unlike my i7-12700K. On his particular setup Numba was able to get a massive speedup on the initial version, remove_noise_numba_0(), compared to NumPy, with no further changes needed. This speedup appeared to be the result of Numba using AVX-512 SIMD instructions, instead of the branch-based machine code generated on my computer.

Which of the Numba-based variations ended up using SIMD?

For floating point SIMD my computer’s Intel CPU can report how many instructions were run (accessible e.g. via perf stat), but we’re using integer SIMD in this example and that stat is not available. However, you can get SIMD usage diagnostics for Numba just-in-time compilation by running this beforehand:

import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

As far as I can tell, all the Numba-based attempts were using some level of SIMD. But SIMD is just one way to speed up your code, and there are other bottlenecks, and version 1 (on my computer, at least) involved lots of branching.

Learning more