Faster Python calculations with Numba: 2 lines of code, 13× speed-up
Python is a slow language, so computation is best delegated to code written in something faster. You can do this with existing libraries like NumPy and SciPy, but what happens when you need to implement a new algorithm, and you don’t want to write code in a lower-level language?
For certain types of computation, in particular array-focused code, the Numba library can significantly speed up your code. Sometimes you’ll need to tweak it a bit, sometimes it’ll just work with no changes. And when it works, it’s a very transparent speed fix.
In this article we’ll cover:
- Why using NumPy on its own is sometimes not enough.
- The basics of using Numba.
- How Numba works, at a high-level, and the difference that makes to how your code runs.
When NumPy doesn’t help
Let’s say you have a very large array, and you want to calculate the monotonically increasing version: values can go up, but never down. For example:
[1, 2, 1, 3, 3, 5, 4, 6] → [1, 2, 2, 3, 3, 5, 5, 6]
Here’s a straightforward in-place implementation:
def monotonically_increasing(a): max_value = 0 for i in range(len(a)): if a[i] > max_value: max_value = a[i] a[i] = max_value
There’s a problem, though. NumPy is fast because it can do all its calculations without calling back into Python. Since this function involves looping in Python, we lose all the performance benefits of using NumPy.
For a 10,000,000-entry NumPy array, this functions takes 2.5 seconds to run on my computer. Can we do better?
Numba can speed things up
Numba is a just-in-time compiler for Python specifically focused on code that runs in loops over NumPy arrays. Exactly what we need!
All we have to do is add two lines of code:
from numba import njit @njit def monotonically_increasing(a): max_value = 0 for i in range(len(a)): if a[i] > max_value: max_value = a[i] a[i] = max_value
This runs in 0.19 seconds, about 13× faster; not bad for just reusing the same code!
Of course, it turns out that NumPy has a function that will do this already,
Using that, running only takes 0.03 seconds.
When you can find a NumPy or SciPy function that does what you want, problem solved.
But what if
numpy.maximum.accumulate hadn’t existed?
At that point the other option if you wanted a fast result would be to write some low-level code, but that means switching programming languages, a more complex build system, and more complexity in general.
With Numba you can:
- Run the same code both in normal Python, and in a faster compiled version, from inside the normal interpreter runtime.
- Easily and quickly iterate on algorithms.
Numba parses the code, and then compiles it in a just-in-time manner depending on the inputs.
You will get different versions of the code depending on whether the input is an array of
u64 vs an array of floats, for example.
Numba can also target runtimes other than CPUs; you can run the code on a GPU, for example. Additionally, the above example is just the most minimal usage pattern for Numba; it has many more features covered in the documentation.
Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.
Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production.
Some limitations of Numba
The one-time cost of just-in-time compilation
The first time you call a function decorated with Numba, it will need to generate the appropriate machine code, which can take some time.
For example, we can use IPython’s
%time command to measure how long it takes to run a Numba-decorated function:
In : from numba import njit In : @njit ...: def add(a, b): a + b In : %time add(1, 2) CPU times: user 320 ms, sys: 117 ms, total: 437 ms Wall time: 207 ms In : %time add(1, 2) CPU times: user 17 µs, sys: 0 ns, total: 17 µs Wall time: 24.3 µs In : %time add(1, 2) CPU times: user 8 µs, sys: 2 µs, total: 10 µs Wall time: 13.6 µs
The first call is extremely slow (milliseconds rather than microseconds) because it needs to compile the code; after that it runs quickly.
This is a one-time cost, per type of input. For example, we’ll have to pay it again if we pass in floats:
In : %time add(1.5, 2.5) CPU times: user 40.3 ms, sys: 1.14 ms, total: 41.5 ms Wall time: 41 ms In : %time add(1.5, 2.5) CPU times: user 16 µs, sys: 3 µs, total: 19 µs Wall time: 26 µs
There’s no reason to use Numba to add two numbers, but given the function does so little, this is a good demonstration of the unavoidable one-time compilation overhead.
A different implementation of Python and NumPy
Numba re-implements a subset of Python, and a subset of the NumPy APIs. This leads to potential issues:
- Some features aren’t supported, both in the Python language and in NumPy APIs.
- Because Numba re-implements NumPy’s APIs from scratch, you may get:
- Different performance behavior, due to using a different algorithm.
- Potentially, different results due to bugs.
In addition, when Numba fails to compile some code, my experience is that the error messages can often be difficult to understand.
Numba vs. alternatives
How does Numba compare to other options?
- If you can use fast NumPy or Scipy APIs exclusively, you can write your code in Python and get the benefit and speed of a low-level compiled language.
Sometimes you will need to revert to slow
forloops in Python, which will lose much of the performance benefits.
- You can write code in a low-level language directly; this means you can optimize all your code paths, but you are now leaving Python for a different language.
- With Numba, you can get fast code from regular Python
forloops, but you’re limited in which language features and NumPy APIs you can use.
The nicest thing about Numba is how easy it is to try out.
So whenever you have a slow
for loop doing some math, give Numba a spin; with any luck it’ll speed things with just two lines of code.
Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler
Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.
Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support and more.
Learn practical Python software engineering skills you can use at your job
Sign up for my newsletter, and join over 6700 Python developers and data scientists learning practical tools and techniques, from Python performance to Docker packaging, with a free new article in your inbox every week.