Memory location matters for performance

If you’re writing high-performance code, what matters is how many CPU instructions you run. Right?

That’s what I learned long ago in school, anyway, when I took algorithms and data structures classes. The basic assumption was that reading or writing to memory took a small, constant amount of time, and so you could focus on how many operations were being done. This was so fundamental to how we were taught that we didn’t even have to write real code: pseudo-code was sufficient.

Unfortunately, this is not how a CPU works in the real world.

In the real world, access to memory is vastly slower than CPU instructions, and all those O(N) measurements that just focus on CPU operations are insufficient to tell you what real-world performance will be. And that means you need to think about not just your computation, but also how you access memory.

An updated version of this article is included in a book I’m working on that will help teach you how to optimize low-level code, the kind of code you’d write with Cython, C, or Rust. The goal is to help data scientists and scientists who normally write Python to understand how to make their compiled code faster.

Same number of operations, different performance

Let’s see the impact of memory access on performance. I’ll be using Rust for my example code because a low-level language maps more closely to what the CPU is doing. Writing x = y + 2 in Rust will often translate directly to the relevant CPU instructions of adding 2 to a value and then shoving the result somewhere. That same code in Python will involve vastly more work.

Consider the following program; all it does is increment values in a contiguous chunk of memory:

use std::env::args;

fn main() {
    // Get the first command-line argument:
    let scan_type = args().nth(1).expect(
        "Must provide scan type argument.");

    // Get the second argument, convert to an integer:
    let vector_size = args().nth(2).expect(
        "Must provide array size argument.");
    let vector_size = vector_size.parse::<usize>().expect(
        "Not an integer.");

    // Figure out an appropriate multiplier:
    let multiplier = match &*scan_type {
        "linear" => 1,
        "random" => 48271,
        _ => panic!("Unknown scan type."),
    };

    // Create 4 vectors (for our purposes, an array in
    // memory) of size vector_size, filled with zeros.
    let mut data = vec![0; vector_size];
    let mut data2 = vec![0; vector_size];
    let mut data3 = vec![0; vector_size];
    let mut data4 = vec![0; vector_size];

    // Update values in the vectors. A multiplier of 1
    // results in a linear memory scan, whereas 48271
    // will result in jumping around memory "randomly":
    let mut index = 0;
    for _ in 0..100_000_000 {
        data[index] += 1;
        data2[index] += 1;
        data3[index] += 1;
        data4[index] += 1;
        index = (multiplier * index + 1) % vector_size;
    }
}

Now, there are two ways we scan through memory with this program:

  1. Linearly, from start to finish.
  2. Randomly bouncing around.

In either case, we’ll be doing the same number of operations. So per the model I learned in college, both variations should take the same amount of time because they’re doing the same amount of work. Let’s try it out:

$ time ./memory-cache linear 1000000

real    0m0.936s
user    0m0.909s
sys     0m0.022s
$ time ./memory-cache random 1000000

real    0m4.943s
user    0m4.902s
sys     0m0.014s

Bouncing around memory at random is a lot slower than a linear scan of memory—but why?

Slow memory and the CPU cache hierarchy

It turns out that from a CPU’s perspective, reading from RAM is quite slow. To speed things up, CPU designers provide a series of caches to reduce the need to access RAM.

To simplify dramatically, here’s how the CPU talks to RAM:

G l1i L1 cache (instructions) l2 L2 cache l1i->l2 l1d L1 cache (data) l1d->l2 l3 L3 cache l2->l3 RAM RAM l3->RAM CPU CPU CPU->l1i CPU->l1d

From fastest to slowest:

  1. The L1 caches are very fastest and very small, and might be per-CPU or per-core. There’s one for instructions (i.e. the code you’re running) and one for the data your program is using.
  2. The L2 cache is bigger but slower, and can store both instructions and data.
  3. The L3 cache is even bigger and even slower, and can store both instructions and data.
  4. Finally, there’s RAM, which is even slower to access.

My CPU, for example, has 4 cores with:

  1. 32KB L1 data cache and 32KB L1 instruction per core.
  2. 256KB L2 cache per core.
  3. 8MB L3 cache shared by all the cores.

Caches and memory access performance

Given we’re only accessing values infrequently in both cases, linear scan and random access, why is linear scan so much faster?

Cache lines

Memory is loaded into the caches in chunks known as “cache lines”, typically 64 bytes at a time. So when we access the first value of the array, it gets loaded into the L1 cache—but so do the 2nd, 3rd, 4th, etc. values in the array. Specifically, each integer is 8 bytes, so 8 values get loaded into the L1 cache at a time.

So in a linear scan, we only have to do 1/8th of the cache loads that the random scan does. That’s because in a linear scan, every memory access loads both the current value and the next few values into the cache; in a random scan we’re typically only accessing 8 of the 64 bytes in the cache line and then we move on to some completely different memory address.

Pre-fetching

Modern CPUs will also pre-fetch memory into the cache as an optimization technique. By the time you’ve gotten to the point of needing to read the next chunk of memory, it might already be in the cache.

Linear scans are a fairly common usage pattern in software, and easy to detect, so there’s a decent chance the CPU will notice and prefetch memory into the cache. Random scans are, of course, much harder to predict!

Measuring cache misses

On Linux, you can use the perf stat tool to measure cache misses. You can do it down to the level of specific caches (L1 vs L3, say), but for our purposes just cache misses in general are sufficient to see what’s going on:

$ perf stat -e instructions,cache-misses ./memory-cache linear 1000000
     1,500,537,526      instructions:u
         1,190,670      cache-misses:u

$ perf stat -e instructions,cache-misses ./memory-cache random 1000000
     1,500,538,598      instructions:u
       262,120,756      cache-misses:u

As expected, the number of CPU instructions is almost identical, but the number of cache misses is vastly higher for random access.

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks
A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage

Memory matters

While the effects we’re seeing here are in native code, where calculations run a lot faster than they would Python, a lot of high-performance Python code does the heavy lifting in a low-level language like C or Rust. So if you’re dealing with large amounts of data, memory access may well be just as much of a bottleneck as calculation.

To speed things up, consider ways to:

  • Iterate over the data in the direction it is laid out in memory: iterating across dimensions that don’t match memory layout will be slower.
  • Take advantage of cache lines, for example by favoring linear scans over random reads.
  • Shrink the working set of data so it’s more likely to fit in your CPU caches.