The wrong way to speed up your code with Numba
If your NumPy-based code is too slow, you can sometimes use Numba to speed it up. Numba is a compiled language that uses the same syntax as Python, and it compiles at runtime, so it’s very easy to write. And because it re-implements a large part of the NumPy APIs, it can also easily be used with existing NumPy-based code.
However, Numba’s NumPy support can be a trap: it can lead you to missing huge optimization opportunities by sticking to NumPy-style code. So in this article we’ll show an example of:
- The wrong way to use Numba, writing NumPy-style full array transforms.
- The right way to use Numba, namely
for
loops.
An example: converting color images to grayscale
Consider a color image encoded with red, green, and blue channels:
from skimage import io
RGB_IMAGE = io.imread("dizzymouse.jpg")
print("Shape:", RGB_IMAGE.shape)
print("dtype:", RGB_IMAGE.dtype)
print("Memory usage (bytes):", RGB_IMAGE.size)
Here’s the output:
Shape: (525, 700, 3)
dtype: uint8
Memory usage (bytes): 1102500
And here’s what the image looks like:
We want to convert this image to grayscale. Instead of having three channels for red, green, and blue, we’ll have just one channel that measures brightness, with 0 being black and 255 being white. Here’s one simplistic way to do this transformation:
import numpy as np
def tg_numpy(color_image):
result = np.round(
0.299 * color_image[:, :, 0] +
0.587 * color_image[:, :, 1] +
0.114 * color_image[:, :, 2]
)
return result.astype(np.uint8)
GRAYSCALE = tg_numpy(RGB_IMAGE)
And here’s what the resulting image looks like:
Using Numba, the wrong way
Numba lets us compile Python code to machine code, simply by adding the
@numba.jit
decorator. For NumPy APIs used in the decorated function, the resulting
machine code doesn’t use the NumPy library. Instead, Numba has
reimplemented these APIs in a mostly-compatible way using the Numba
language.
One way we can use Numba, then, is to take our existing NumPy code, and just add a decorator:
from numba import jit
@jit
def tg_numba(color_image):
result = np.round(
0.299 * color_image[:, :, 0] +
0.587 * color_image[:, :, 1] +
0.114 * color_image[:, :, 2]
)
return result.astype(np.uint8)
GRAYSCALE2 = tg_numba(RGB_IMAGE)
assert np.array_equal(GRAYSCALE, GRAYSCALE2)
Is this any faster? Let’s see:
Code | Elapsed microseconds | Peak allocated memory (bytes) |
---|---|---|
tg_numpy(RGB_IMAGE) |
2,712 | 6,021,410 |
tg_numba(RGB_IMAGE) |
2,446 | 5,889,234 |
So it is faster, but only a little. This isn’t surprising: NumPy internally is also implemented in a compiled language, so individual operations on arrays are already quite optimized.
It’s also worth noticing the memory usage. Our original image is 1.1MB,
and we’re allocating around 6MB to transform it to a grayscale image.
This is because we have up to two temporary floating point arrays at any
given time. Since float64
uses 8× as much memory as a uint8
, this
adds up to quite a bit of memory.
And since we’re using the same algorithm as the original NumPy code, complete with temporary arrays, we have the same problem with allocated memory being 6× the size of the input image.
Using Numba, the right way
Our current code creates temporary floating point arrays, and then multiplies and adds them. But there really is no reason to have a whole temporary floating point array; that’s a result of the limits of how NumPy works. It needs to operate on whole arrays (so-called “vectorization”) so that it doesn’t use slow Python code. From an algorithm perspective, we can convert each pixel individually.
Numba doesn’t have the same limits as NumPy and normal Python: you can use for
loops in Numba and your code
will still run quickly. So in this case we can use a for
loop to operate pixel
by pixel, at the very least reducing the memory allocations in our
function. Let’s try that out:
@jit
def tg_numba_for_loop(color_image):
result = np.empty(color_image.shape[:2], dtype=np.uint8)
for y in range(color_image.shape[0]):
for x in range(color_image.shape[1]):
r, g, b = color_image[y, x, :]
result[y, x] = np.round(
0.299 * r + 0.587 * g + 0.114 * b
)
return result
GRAYSCALE3 = tg_numba_for_loop(RGB_IMAGE)
assert np.array_equal(GRAYSCALE, GRAYSCALE3)
And here’s the performance and memory usage:
Code | Elapsed microseconds | Peak allocated memory (bytes) |
---|---|---|
tg_numpy(RGB_IMAGE) |
2,724 | 6,021,410 |
tg_numba(RGB_IMAGE) |
2,440 | 5,889,234 |
tg_numba_for_loop(RGB_IMAGE) |
536 | 376,733 |
By using Numba the right way, our code is both 5× faster and far more memory efficient.
Why is this version faster? It’s not about the number of CPU instructions;
tg_numba_for_loop
is running 9 million CPU instructions, vs. 15 million CPU instructions for the NumPy version, nowhere near enough to explain a 5× difference in performance. If you want a start at understanding what else is going on here, check out my upcoming book on speeding up low-level code.
Software architecture as a performance constraint
You can speed up your code at multiple levels. In this particular case, we used a particularly powerful approach: switching to a better software architecture.
In particular, NumPy’s full-array paradigm puts hard limits on how you
can implement your code. By switching to a compiled language where for
loops are fast, you have far more options for how you structure your
algorithm. As you can see, this lets you reduce memory usage, enables
implementing algorithms that would be impossible with just
NumPy, and often lets you significantly speed up your code.