How vectorization speeds up your Python code

Python is not the fastest programming language. So when you need to process a large amount of homogeneous data quickly, you’re told to rely on “vectorization.”

This leads to more questions:

  • What does “vectorization” actually mean?
  • When does it apply?
  • How does vectorization actually make code faster?

To answer that question, we’ll consider interesting performance metrics, learn some useful facts about how CPUs work, and discover that NumPy developers are working hard to make your code faster.

What “vectorization” means, and when it applies

Let’s say we have a few million numbers in a list or array, and we want to do some mathematical operations on them. Since we know they are all numbers, and if we’re doing the same operation on all of the numbers, we can “vectorize” the operation, i.e. take advantage of this homogeneity of data and operation.

This definition is still vague, however. To be more specific there at least three possible meanings, depending on who is talking:

  1. API design: A vectorized API is designed to work on homogeneous arrays of data at once, instead of item by item in a for loop. This is orthogonal to performance, though: in theory you might have a fast for loop, or you might have a slow batch API.
  2. A batch operation implemented in a fast language: This is a Python-specific meaning, and does have a performance implication. By doing all that work in C or Rust, you can avoid calling into slow Python.
  3. An operation that takes advantage of specific CPU features: As we’ll see later on, CPUs have specialized support for doing the same operation on multiple data elements.

For our purposes we’ll mostly be focusing on the second meaning, and later on we’ll talk about the third. In all cases, however, we’re assuming we have:

  1. Simple and homogeneous data, e.g. integers or floating point numbers.
  2. More than one item of data.
  3. We’re doing the same operation on multiple data items.

Speeding up code with vectorization

How exactly does vectorization help with performance?

There are three different ways we can speed up Python code by taking advantage of data homogeneity.

We’re going to use the following context manager that will use the Linux perf tool to measure some performance metrics for a block of Python code. The way it works is by attaching to the current running process, and then getting various metrics from the CPU when the context manager finishes.

Feel free to skip ahead and not read this particular piece of code; it might be fairly opaque depending on your previous knowledge. The important thing is the output, which will be explaining in the next section. If you’re interested you can always come back and read it later.

from contextlib import contextmanager
from subprocess import Popen
from os import getpid
from signal import SIGINT
from time import sleep, time
from resource import getrusage, RUSAGE_SELF

events = [
    "instructions",
    "cache-references",
    "cache-misses",
    "avx_insts.all",
]

@contextmanager
def perf():
    """Benchmark this process with Linux's perf util.
    
    Example usage:

        with perf():
            x = run_some_code()
            more_code(x)
            all_this_code_will_be_measured()
    """
    p = Popen([
            # Run perf stat
            "perf", "stat",
            # for the current Python process
            "-p", str(getpid())
            # record the list of events mentioned above
            "-e", ",".join(events)])
    # Ensure perf has started before running more
    # Python code. This will add ~0.1 to the elapsed
    # time reported by perf, so we also track elapsed
    # time separately.
    sleep(0.1)
    start = time()
    try:
        yield
    finally:
        print(f"Elapsed (seconds): {time() - start}")
        print("Peak memory (MiB):",
            int(getrusage(RUSAGE_SELF).ru_maxrss / 1024))
        p.send_signal(SIGINT)

Part #1: Reducing CPU instructions

The first way vectorization can help is by reducing CPU instructions.

Let’s look at an example: we’re going to normalize an array of double floats (i.e. 64-bit precision) by subtracing the mean. Then we can see how much above or below each item is from the mean.

CPython is slow

Here’s a first attempt:

from benchmark import perf
from random import random

DATA = [random() for _ in range(30_000_000)]

with perf():
    mean = sum(DATA) / len(DATA)
    result = [DATA[i] - mean for i in range(len(DATA))]

As expected, we calculate the mean, and then create a new list of numbers where we subtracted the mean. When we run this, we get the following output:

Elapsed (seconds): 3.3306941986083984
Peak memory (MiB): 2327
...
    31,426,844,839      instructions:u
        11,882,126      cache-references:u
         4,024,256      cache-misses:u
...

As we’ll see shortly, this is very very slow. And the reason is that the default Python interpreter (“CPython”) is implementing all this code in a completely generic way.

CPython doesn’t know it’s dealing with a list of floating point numbers; from its perspective it’s an array of PyObject* pointers. That means the interpreter can’t just use the CPU’s floating point addition instructions: it has to poke at the PyObject* and look up its addition function, and then it has to extract the floating point number from both objects, and then it has to allocate a new PyObject*, and so on and so forth.

And this also means using extra memory to represent the data; as we’ll see later, higher memory usage can also impacts performance.

PyPy is fast

The slowness we saw in CPython is an implementation choice, due to the way the default CPython interpreter works. The PyPy interpreter is smarter. PyPy notices that the list is a homogeneous list of floating point numbers, and takes advantage of that knowledge. Instead of using generic code, it uses just-in-time compilation at runtime to generate machine code specifically tuned for adding floating point numbers.

Here’s the output of running the exact same code with PyPy:

Elapsed (seconds): 0.18796753883361816
Peak memory (MiB): 535
...
     1,280,847,750      instructions:u
           862,534      cache-references:u
           515,675      cache-misses:u
...

Fewer instructions, much lower memory usage, and much faster execution.

C is fast

The NumPy library, written in C, also allows us to run the same code in a vectorized way. Unlike PyPy, which automatically figures out what to do, NumPy requires us to explicitly use specialized array objects with specialized APIs. Here’s the corresponding code:

from benchmark import perf
import numpy as np

DATA = np.random.rand(30_000_000)

with perf():
    mean = DATA.sum() / len(DATA)
    result = DATA - mean

And here’s the output when run:

Elapsed (seconds): 0.0806736946105957
Peak memory (MiB): 483
...
       162,356,695      instructions:u
         9,669,587      cache-references:u
         3,366,269      cache-misses:u
...

The memory layout used by the C code in NumPy is an array of float64s, aka double floats: floating point numbers that are stored in 8 bytes. Since NumPy knows it’s dealing with an array float64s, it can add two numbers with a single CPU instruction. As a result, the number of instructions run by the CPU is smaller by a factor of 20 compared to CPython.

To compare all three:

  Elapsed (secs) CPU instructions (M) Cache misses (M)
CPython 3.33 31,426 4.0
PyPy 0.18 1,280 0.5
NumPy 0.08 162 3.4

Part #2: Reducing memory reads and cache misses

You may have noticed above that peak memory usage is much lower for NumPy than CPython. This is due to NumPy’s better memory layout:

  • In CPython’s case, the list of numbers stores PyObject*, which means the actual number is elsewhere in memory.
  • In contrast, NumPy stores the floating point numbers directly in the array.
  • The PyPy interpreter is smart enough to notice the contents of the data and lay it out appropriately in memory as an array of float64.

Reading RAM is waaaay slower than running a CPU instruction, so the CPU has a series of caches to speed up memory access. If memory is not in the cache, that results in an expensive cache miss. Less memory used means fewer memory lookups (which are slower than CPU instructions), and hopefully fewer cache misses, which are much slower.

That’s another way that vectorization helps: as a pre-requisite to doing it, you need to store data in a more compact, cache-friendly way. And that means faster-running code.

To compare all three:

  Peak memory (MiB) Cache references (M) Cache misses (M)
CPython 2,327 11.9 4.0
PyPy 535 0.9 0.5
NumPy 483 9.7 3.4

It’s not clear to me how PyPy manages to have so much fewer cache references and misses.

Part #3: Parallelism with SIMD

So far we’ve been talking about the Python-specific meaning of vectorization: switching to running all the code quickly in something that translates directly to machine code, bypassing Python’s flexible-but-slow default implementation.

But there’s another meaning for “vectorization” used outside of the Python world: taking advantage of specific CPU functionality designed for this sort of operation. Adding the same number to a whole array of numbers is a common operation. More broadly, computational code will often run operations that involve doing that same thing to multiple items of an array.

In order to make these operations even faster, modern CPUs provide “SIMD” instructions: Single Instruction, Multiple Data.

  • With normal CPU instructions, only one pair of numbers can be added at a time.
  • Using a SIMD instruction, the CPU hardware can add multiple pairs at the same time.

From a Python programmer’s perspective, SIMD instructions are too low-level to access directly. Instead, you need to do Python-style vectorization: delegate all the work to faster code that bypasses Python’s normal flexibility. Whatever provides that fast code, NumPy for example, can then implement SIMD for you.

Recent versions of NumPy have started implementing SIMD, with more operations being added with every release. So let’s compare NumPy 1.18, which doesn’t have much SIMD support, with NumPy 1.22.1, which does.

In particular, I’m going to be working with 300 million float64 numbers, and measuring how many AVX instructions are being run. AVX is one family of SIMD instructions provided by modern Intel and AMD CPUs, and specifically it’s the most advanced one supported by my personal computer. Your computer might end up using different SIMD instructions than mine.

Here’s the output with NumPy 1.18, which I was using in previous parts of this article:

Elapsed (seconds): 5.30308723449707
...
     1,623,405,861      instructions:u
               160      avx_insts.all:u
...

And here’s the output with NumPy 1.22.1:

Elapsed (seconds): 3.5732483863830566
...
     1,100,665,494      instructions:u
       225,000,272      avx_insts.all:u
...

To compare:

  Elapsed (secs) Instructions (M) AVX instructions (M)
No SIMD 5.3 1,623 0
SIMD 3.6 1,100 225

We’re using fewer instructions, because AVX instructions allow adding four float64s at a time, instead of one at a time. And a bunch of instructions have switched from “add a number” to an AVX-based “add four numbers in parallel.”

The vectorization performance trifecta

To summarize, if you have homogeneous data that you will be processing the same way, you can achieve speed-ups using vectorization.

In the context of Python that means switching to a memory representation that low-level code can access efficiently. Your code will then run faster due to three different causes:

  1. Simple operations on homogeneous data can be translated into more efficient machine code, reducing the number of CPU instructions that need to run.
  2. A more efficient memory layout will reduces cache misses.
  3. The low-level code may be able use hardware features like SIMD. In the Python world, this is done by recent versions of NumPy, and in some cases by Numba.

Of course, nothing is perfect, so you may also be interested in reading more about the limits of and alternatives to NumPy’s style of vectorization.