Faster Python calculations with Numba: 2 lines of code, 13× speed-up

Python is a slow language, so computation is best delegated to code written in something faster. You can do this with existing libraries like NumPy and SciPy, but what happens when you need to implement a new algorithm, and you don’t want to write code in a lower-level language?

For certain types of computation, in particular array-focused code, the Numba library can significantly speed up your code. Sometimes you’ll need to tweak it a bit, sometimes it’ll just work with no changes. And when it works, it’s a very transparent speed fix.

In this article we’ll cover:

  • Why using NumPy on its own is sometimes not enough.
  • The basics of using Numba.
  • How Numba works, at a high-level, and the difference that makes to how your code runs.

When NumPy doesn’t help

Let’s say you have a very large array, and you want to calculate the monotonically increasing version: values can go up, but never down. For example:

[1, 2, 1, 3, 3, 5, 4, 6]  [1, 2, 2, 3, 3, 5, 5, 6]

Here’s a straightforward in-place implementation:

def monotonically_increasing(a):
    max_value = 0
    for i in range(len(a)):
        if a[i] > max_value:
            max_value = a[i]
        a[i] = max_value

There’s a problem, though. NumPy is fast because it can do all its calculations without calling back into Python. Since this function involves looping in Python, we lose all the performance benefits of using NumPy.

For a 10,000,000-entry NumPy array, this functions takes 2.5 seconds to run on my computer. Can we do better?

Numba can speed things up

Numba is a just-in-time compiler for Python specifically focused on code that runs in loops over NumPy arrays. Exactly what we need!

All we have to do is add two lines of code:

from numba import njit

@njit
def monotonically_increasing(a):
    max_value = 0
    for i in range(len(a)):
        if a[i] > max_value:
            max_value = a[i]
        a[i] = max_value

This runs in 0.19 seconds, about 13× faster; not bad for just reusing the same code!

Of course, it turns out that NumPy has a function that will do this already, numpy.maximum.accumulate. Using that, running only takes 0.03 seconds.

  Runtime
Python for loop 2560ms
Numba for loop 190ms
np.maximum.accumulate 30ms

Introducing Numba

When you can find a NumPy or SciPy function that does what you want, problem solved. But what if numpy.maximum.accumulate hadn’t existed? At that point the other option if you wanted a fast result would be to write some low-level code, but that means switching programming languages, a more complex build system, and more complexity in general.

With Numba you can:

  • Run the same code both in normal Python, and in a faster compiled version, from inside the normal interpreter runtime.
  • Easily and quickly iterate on algorithms.

Numba parses the code, and then compiles it in a just-in-time manner depending on the inputs. You will get different versions of the code depending on whether the input is an array of u64 vs an array of floats, for example.

Numba can also target runtimes other than CPUs; you can run the code on a GPU, for example. Additionally, the above example is just the most minimal usage pattern for Numba; it has many more features covered in the documentation.

Some limitations of Numba

The one-time cost of just-in-time compilation

The first time you call a function decorated with Numba, it will need to generate the appropriate machine code, which can take some time. For example, we can use IPython’s %time command to measure how long it takes to run a Numba-decorated function:

In [1]: from numba import njit

In [2]: @njit
   ...: def add(a, b): a + b

In [3]: %time add(1, 2)
CPU times: user 320 ms, sys: 117 ms, total: 437 ms
Wall time: 207 ms

In [4]: %time add(1, 2)
CPU times: user 17 µs, sys: 0 ns, total: 17 µs
Wall time: 24.3 µs

In [5]: %time add(1, 2)
CPU times: user 8 µs, sys: 2 µs, total: 10 µs
Wall time: 13.6 µs

The first call is extremely slow (milliseconds rather than microseconds) because it needs to compile the code; after that it runs quickly.

This is a one-time cost, per type of input. For example, we’ll have to pay it again if we pass in floats:

In [8]: %time add(1.5, 2.5)
CPU times: user 40.3 ms, sys: 1.14 ms, total: 41.5 ms
Wall time: 41 ms

In [9]: %time add(1.5, 2.5)
CPU times: user 16 µs, sys: 3 µs, total: 19 µs
Wall time: 26 µs

There’s no reason to use Numba to add two numbers, but given the function does so little, this is a good demonstration of the unavoidable one-time compilation overhead.

A different implementation of Python and NumPy

Numba re-implements a subset of Python, and a subset of the NumPy APIs. This leads to potential issues:

  • Some features aren’t supported, both in the Python language and in NumPy APIs.
  • Because Numba re-implements NumPy’s APIs from scratch, you may get:
    • Different performance behavior, due to using a different algorithm.
    • Potentially, different results due to bugs.

In addition, when Numba fails to compile some code, my experience is that the error messages can often be difficult to understand.

Numba vs. alternatives

How does Numba compare to other options?

  • If you can use fast NumPy or Scipy APIs exclusively, you can write your code in Python and get the benefit and speed of a low-level compiled language. Sometimes you will need to revert to slow for loops in Python, which will lose much of the performance benefits.
  • You can write code in a low-level language directly; this means you can optimize all your code paths, but you are now leaving Python for a different language.
  • With Numba, you can get fast code from regular Python for loops, but you’re limited in which language features and NumPy APIs you can use.

The nicest thing about Numba is how easy it is to try out. So whenever you have a slow for loop doing some math, give Numba a spin; with any luck it’ll speed things with just two lines of code.