Speeding up Cython with SIMD

Cython allows you to write compiled extensions for Python, by translating Python-y code to C or C++. Often you’ll use it to speed up your software, and it’s especially useful for implementing small data science or scientific computing algorithms.

But what happens when Cython is too slow? Often there’s still speed improvements you can do. In a previous article we focused on examples of optimizing your code to take advantage of things like instruction-level parallelism.

In this article, we’ll focus on another CPU feature, Single Instruction Multiple Data or SIMD, specifically in the context of Cython. As well see, in some situations using SIMD can happen with only minimal changes to your code.

Parts of this article are excerpted from a book I’m working on that will help teach you how to optimize low-level code, the kind of code you’d write with Cython, C, or Rust. The goal is to help data scientists and scientists who normally write Python to understand how to make their compiled code faster.

What is SIMD?

Single Instruction Multiple Data, or SIMD, are CPU instructions that can execute the same operation on a sequence primitive values (integers or floats) that are stored next to each other in memory, using a single instruction. The compiler needs to explicitly generate these special instructions, which as we’ll see can make it harder to utilize this functionality.

For example, if we want to multiply four consecutive integers in an array by the same constant, we can do that with a single SIMD CPU instruction instead of with 4 normal CPU instructions. Reducing the number of CPU instructions by a factor of 4 can lead to a significant speedup!

Three ways to use SIMD

If you want to use SIMD, there are different ways you can get the compiler to generate these specialized CPU instructions:

  1. Intrinsics: You directly tell the compiler to use specific CPU instructions, using special compiler features called “intrinsics”. You can do this with Cython, but it’s not portable. For example, x86-64 CPU instructions are different than ARM CPU instructions.
  2. Auto-vectorization: Compilers can generate SIMD instructions automatically, especially if you structure your code appropriately. Because this relies on compiler heuristics you have less control, and different compiler versions might give different results. (Don’t confuse this with the Python meaning of “vectorized”; in the context of low-level languages like C, “vectorized” is another word for SIMD.)
  3. SIMD libraries: C++ has libraries available providing data structures that will use SIMD for their mathematical operations. This gives you more abstraction than intrinsics, while being more reliable than auto-vectorization. Cython can probably use these, but we won’t be covering them here.

In this article we’ll focus on using auto-vectorization: getting the compiler to generate SIMD instructions for you.

Our starting code

We’ll start with the following Cython extension:

import numpy as np
cimport numpy as cnp

def average_arrays_1(
    cnp.float32_t[:] arr1,
    cnp.float32_t[:] arr2
):
    out = np.empty((len(arr1), ), dtype=np.float32)
    cdef cnp.float32_t[:] out_view = out
    for i in range(len(arr1)):
        out_view[i] = (arr1[i] + arr2[i]) / 2
    return out

The cnp.float32_t[:] is the memoryview Cython API; we’re using it on all arrays to make sure Cython has an easier time translating the code to C. This ensures we’re using fast low-level operations instead of generic “index some random Python object” operations that would be no faster than normal Python.

We want to measure the speed of our function in two situations:

  • Two arrays where all the memory is contiguous, linear memory.
  • Two views of arrays, where we’re grabbing only every 16th item, using a NumPy feature called “strides”. The views the code will be averaging will not have entries next to each other in memory; our code will need to skip and do every 16th item.

To give an example of what I mean in the latter case, in the following snippet we grab every 2nd item. The object we create is a view, meaning we’re not copying the memory. Notice that if we mutate the view, it affects the original array:

>>> import numpy as np
>>> arr = np.array([0, 1, 2, 3, 4, 5, 6])
>>> view = arr[::2]
>>> view
array([0, 2, 4, 6])
>>> view[1] = 100
>>> arr
array([  0,   1, 100,   3,   4,   5,   6])

Here’s our benchmark script:

from time import time
import numpy as np
import sys
import simd_example

# Use the first command-line argument to choose which
# function to call:
average_arrays = getattr(simd_example, sys.argv[1])


def timeit(title, arr1, arr2):
    start = time()
    for i in range(1_000):
        out = average_arrays(arr1, arr2)
    elapsed = (time() - start) / 1_000
    print(f"Time per call, {title}: {elapsed * 1000:.2} ms")


# Arrays laid out linearly in memory:
ARR1 = np.random.random((1_000_000,)).astype(np.float32)
ARR2 = np.random.random((1_000_000,)).astype(np.float32)
timeit("contiguous", ARR1, ARR2)

# Arrays where we grab every 16th item:
ARR3 = np.random.random((16_000_000,)).astype(np.float32)
ARR4 = np.random.random((16_000_000,)).astype(np.float32)
timeit("strided", ARR3[::16], ARR4[::16])

We’re not using SIMD!

On Intel CPUs, if you’re using floating point operations you can check if SIMD is being used by inspecting CPU counters. On Linux these are exposed by the perf program. For integer operations there are more difficult ways of checking, for example looking at the generated assembly code, but we’ll do it the easy way. My CPU doesn’t support 512-bit SIMD, so we’re only looking at 128- and 256-bit SIMD, like this:

$ perf stat -e instructions:u,fp_arith_inst_retired.128b_packed_single:u,fp_arith_inst_retired.256b_packed_single:u -- python benchmark.py average_arrays_1
Time per call, contiguous: 0.57 ms
Time per call, strided: 4.2 ms

 Performance counter stats for 'python benchmark.py average_arrays_1':

    31,773,910,107      cpu_core/instructions:u/
       459,588,037      cpu_atom/instructions:u/
                 0      cpu_core/fp_arith_inst_retired.128b_packed_single:u/
                 0      cpu_core/fp_arith_inst_retired.256b_packed_single:u/

The “_single” suffix means 32-bit floats; for 64-bit floats we’d want the “_double” suffix. It looks like there were no “packed” (i.e SIMD) operations done at all. Let’s fix that.

Step 1. Enabling auto-vectorization in the compiler

Why no SIMD? Recall that to use SIMD the compiler has to generate special instructions. Specifically:

  1. We want to use auto-vectorization, where the compiler does this automatically.
  2. Which means the compiler has to successfully use its heuristics to generate SIMD instructions.

As it turns out, by default on Linux Python uses the -O2 compiler optimization level for extensions, which doesn’t even try to do auto-vectorization. So we’ll want to switch to -O3 optimization level to turn this and other optimizations on.

You can do this in the setup.py config for Cython, but we’ll just set an environment variable and recompile:

$ export CFLAGS="-O3"
$ pip install .

Unfortunately, if we rerun our benchmark we get the same results as above. This suggests the compiler heuristics are failing to figure out how to use SIMD with the code.

Step 2. Disabling bounds checks and other implicit conditionals

Recall that SIMD does the same operation on a series of values. For example, we’re hoping the compiler will decide to use SIMD instructions to add one set of float32s to another set of float32s with a single CPU instruction.

Sometimes the compiler won’t be able to use SIMD instructions. For our purposes, if you have branches—if statements and other conditional control flow—that makes it harder to do the same operation with a single instruction. If the branch decides to exit the loop on the first item, how could a single CPU instruction handle the remaining items?

Our code doesn’t have any explicit conditionals, so that’s not a problem. It does, however, have some implicit conditionals: bounds checking and index wrapping.

  1. In order to prevent reading from (or writing to!) invalid memory addresses, Cython adds a bounds check to each array lookup to make sure you haven’t gone too far. It raises an exception if the index is out bounds—in other words, it’s a branch.
  2. In order to support indexing with negative numbers, the way Python does, Cython also adds branches.

On a good day the compiler might optimize some of these away, but in this case we probably will have to disable them. This means we have to be extra careful, because the compiler isn’t helping us catch bugs anymore!

Note: Turning off bounds checking is a terrible idea. It’s a recipe for crashes, security problems, and even worse: silent failures where you get wrong results without knowing it.

When using Rust people have come up with ways to keep bounds checking on while giving hints to the compiler so it can optimize them out of existence. Figuring out if this is possible in Cython is something I hope to investigate.

Here’s our new version:

import numpy as np
cimport numpy as cnp
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def average_arrays_2(
    cnp.float32_t[:] arr1,
    cnp.float32_t[:] arr2
):
    if len(arr1) != len(arr2):
        raise ValueError("Arrays must be the same size")
    out = np.empty((len(arr1), ), dtype=np.float32)
    cdef cnp.float32_t[:] out_view = out
    for i in range(len(arr1)):
        out_view[i] = (arr1[i] + arr2[i]) / 2
    return out

The new version runs about 15% faster in the linear case, which is nice, but it’s still not using any SIMD. (See the end of the article for the actual numbers.)

Step 3. Getting rid of stride support

It’s time to inspect the C code that Cython generates from the .pyx file. We can annotate the code with the cythonize tool, and that gives us an HTML file we can open.

$ cythonize --annotate src/simd_example/_extension.pyx
$ ls src/simd_example/*.html
src/simd_example/_extension.html

The report maps Cython code to C code, and if we look at the C code generated for out_view[i] = (arr1[i] + arr2[i]) / 2 we see:

*((__pyx_t_5numpy_float32_t *) ( /* dim=0 */ (__pyx_v_out_view.data + __pyx_t_13 * __pyx_v_out_view.strides[0]) )) = (((*((__pyx_t_5numpy_float32_t *) ( /* dim=0 */ (__pyx_v_arr1.data + __pyx_t_11 * __pyx_v_arr1.strides[0]) ))) + (*((__pyx_t_5numpy_float32_t *) ( /* dim=0 */ (__pyx_v_arr2.data + __pyx_t_12 * __pyx_v_arr2.strides[0]) )))) / 2.0);

That’s a lot more complicated than our Cython code… because of strides. Basically, Cython has to handle the case where we passed in a NumPy view that is not reading all the memory, but instead skips every N items. It can’t assume the memory is contiguous.

If we have non-contiguous memory, from the compiler’s auto-vectorization perspective using SIMD is likely to be a lot less compelling, given different items might well be very distant from each other in memory. And even if we can’t optimize that case, contiguous memory situations ought to be faster. So as a first pass, let’s tell Cython to only support contiguous memory, indicated by switching float32_t[:] to float32_t[::1]. See the Cython docs for more details on this syntax.

@cython.boundscheck(False)
@cython.wraparound(False)
def average_arrays_3(cnp.float32_t[::1] arr1, cnp.float32_t[::1] arr2):
    if len(arr1) != len(arr2):
        raise ValueError("Arrays must be the same size")
    out = np.empty((len(arr1), ), dtype=np.float32)
    cdef cnp.float32_t[::1] out_view = out
    for i in range(len(arr1)):
        out_view[i] = (arr1[i] + arr2[i]) / 2
    return out

The generated C code no longer refers to strides:

*((__pyx_t_5numpy_float32_t *) ( /* dim=0 */ ((char *) (((__pyx_t_5numpy_float32_t *) __pyx_v_out_view.data) + __pyx_t_13)) )) = (((*((__pyx_t_5numpy_float32_t *) ( /* dim=0 */ ((char *) (((__pyx_t_5numpy_float32_t *) __pyx_v_arr1.data) + __pyx_t_11)) ))) + (*((__pyx_t_5numpy_float32_t *) ( /* dim=0 */ ((char *) (((__pyx_t_5numpy_float32_t *) __pyx_v_arr2.data) + __pyx_t_12)) )))) / 2.0);

We recompile and run our benchmark:

Time per call, contiguous: 0.24 ms
Traceback (most recent call last):
  File "/home/itamarst/devel/simd-profiles/benchmark.py", line 28, in <module>
    timeit("strided", ARR3[::16], ARR4[::16])
  File "/home/itamarst/devel/simd-profiles/benchmark.py", line 14, in timeit
    out = average_arrays(arr1, arr2)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/simd_example/_extension.pyx", line 41, in simd_example._extension.average_arrays_3
    def average_arrays_3(cnp.float32_t[::1] arr1, cnp.float32_t[::1] arr2):
  File "<stringsource>", line 663, in View.MemoryView.memoryview_cwrapper
  File "<stringsource>", line 353, in View.MemoryView.memoryview.__cinit__
ValueError: ndarray is not C-contiguous

 Performance counter stats for 'python benchmark.py average_arrays_3':

     4,940,110,700      cpu_core/instructions:u/
        66,742,512      cpu_atom/instructions:u/
       500,000,000      cpu_core/fp_arith_inst_retired.128b_packed_single:u/
                 0      cpu_core/fp_arith_inst_retired.256b_packed_single:u/

The good news:

  • We’re finally using SIMD!
  • We’re now twice as fast when in the contiguous memory case.

The bad news is that we can’t handle non-contiguous memory: strided views blow up. But we can fix that.

Step 4. Add back support for strided memory

The way we’re going to solve this is by having two versions of the function:

  • One for contiguous memory, which will use SIMD and run faster.
  • The other will fall back to the more generic version, that doesn’t use SIMD. It’ll be slower, but at least our code will run.

To reduce code duplication, we’ll do this with a Cython feature called “fused types”, which is a simplified version of C++ templates or Rust generics. Essentially Cython will end up generating two versions of our function, one for each type.

# This type will be one of these two underlying types
# at compile time:
ctypedef fused float_arr_t:
   cnp.float32_t[:]
   cnp.float32_t[::1]

@cython.boundscheck(False)
@cython.wraparound(False)
cdef average_impl(
    float_arr_t arr1,
    float_arr_t arr2
):
    out = np.empty((len(arr1), ), dtype=np.float32)
    cdef cnp.float32_t[::1] out_view = out
    for i in range(len(arr1)):
        out_view[i] = (arr1[i] + arr2[i]) / 2
    return out

def average_arrays_4(arr1, arr2):
    if len(arr1) != len(arr2):
        raise ValueError("Arrays must be the same size")
    if arr1.data.contiguous and arr2.data.contiguous:
        return average_impl[cnp.float32_t[::1]](arr1, arr2)
    else:
        return average_impl[cnp.float32_t[:]](arr1, arr2)

If we compile this and run our benchmark, we get:

Time per call, contiguous: 0.24 ms
Time per call, strided: 4.0 ms

 Performance counter stats for 'python benchmark.py average_arrays_4':

    13,953,180,922      cpu_core/instructions:u/
       412,866,119      cpu_atom/instructions:u/
       500,000,000      cpu_core/fp_arith_inst_retired.128b_packed_single:u/
                 0      cpu_core/fp_arith_inst_retired.256b_packed_single:u/

Our function is using SIMD when it can, and falling back to a slower path when it can’t. And it has minimal code duplication.

Reviewing our results: the speedup and how we got it

Here’s a comparison of our different versions:

Version Contiguous run time (ms) Strided run time (ms)
Baseline 0.57 4.2
-O3 compiler flag 0.57 4.2
Disabled bounds checks 0.46 4.1
SIMD for contiguous memory 0.24 4.0

It’s not clear if the difference in run time for strided arrays is meaningful, but for contiguous arrays our code now runs twice as fast. And making these changes wasn’t particularly invasive.

To recap, we had to:

  1. Add a -O3 compiler flag. This will work on Linux and macOS compilers. On Windows the defaults might suffice or you might need to set a similar but different flag, I haven’t checked.
  2. Carefully disable bounds and wraparound checks. This is where a good test suite is helpful—we made sure to test invalid inputs!
  3. Make sure Cython can generate the simpler code that contiguous arrays enable.

If you’re maintaining Cython code, consider making these changes! It won’t always be enough to start using SIMD, of course. This is a simple example where auto-vectorization was very likely to work, and more complex code might fail. But in many cases it will make your code faster.

If you can’t get auto-vectorization to work, you can:

  • Enable gcc output of vectorization issues: export CFLAGS="-O3 -fopt-info-vec-missed" and then run pip install --verbose . so you can see compiler output.
  • Try switching to the Clang compiler instead of the default gcc (export CC=clang). To get hints about the compiler’s decision making process, add -Rpass-analysis=loop-vectorize to CFLAGS.

To learn more about SIMD: