CPUs, cloud VMs, and noisy neighbors: the limits of parallelism

Sometimes your program is slow not because of your code, but because of where it’s running. If you have other processes competing for the same limited hardware resources, your code will run more slowly.

Once you add virtualization into the mix, those competing processes might be invisible… but they’re still there.

In this article we’ll cover:

  • The hardware limits of CPUs’ cores and “hyperthreads”.
  • How operating systems deal with multiple processes that want to use limited number of CPU cores.
  • The impact of virtualization on these various resources, including what cloud vCPUs mean.
  • CPU sharing limits available via containers.

The parallelism limits of your CPU

As a gross simplification, back in the day many computers only had a single core, which meant only one computation could happen at any given time. Modern computers typically have multiple cores, so you can run multiple computations in parallel.

For example, I have a computer with an Intel i7-6700T CPU. Per the linked specs, it has 4 cores, so it can run 4 different computations in parallel, one on each core.

If I look at /proc/cpuinfo on Linux, I can get info about these cores:

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700T CPU @ 2.80GHz
stepping        : 3
microcode       : 0xea
cpu MHz         : 2800.000
cache size      : 8192 KB
physical id     : 0
...
processor       : 1
...
processor       : 2
... etc. ...

How operating systems deal with limited cores

No matter how many CPU cores you have, on general purpose computers you are almost always running more processes than cores. For example, the fairly quiescent Linux machine I’m using for this article has 260 different process IDs if I run ps xa, from ssh to the screensaver to bash sessions.

260 is a lot bigger than 4, so how can all those processes proceed? They can’t all run at the same time, only 4 can be using CPU cores at any given time. To make sure all the code can run, the operating system schedules them on and off cores as needed.

To give a simple example, if you have a single core and two processes P0 and P1 that are running full speed doing some computation, the operating system will run:

Core 0: P0, P1, P0, P1, P0, P1

If you have two cores and four processes, it might run:

Core 0: P0, P2, P1, P3, P1
Core 1: P1, P3, P2, P1, P0

Of course, if a process is waiting for something (data to be read from RAM or disk, the network, a lock…) it doesn’t need to use a core at all; that gives an opportunity to other processes to be scheduled. In practice most of the processes on a typical computer just sit there waiting; the ssh process, for example, only does something when a message is sent to it, and I don’t type very fast by computer CPU standards.

Note: A Linux process can have multiple operating systems threads, each of which—from the Linux kernel’s perspective—is really not that different from a process, and is scheduled similarly. I’m trying to avoid using the word “thread” too much, however, since we’ll be encountering it with a very different meaning later in the article.

Sharing CPUs slows down individual processes

Given a number of processes larger than the number of cores, if all those processes want to run computations on the CPU then each process is going to take longer to run. Let’s see some examples.

If I run a little benchmark on one thread, I get the following performance (the --privileged flag is necessary to get full performance):

$ docker run --privileged -v $PWD:/home python:3.10-slim python /home/pystone.py 
Pystone(1.1) time for 50000 passes = 0.18486
This machine benchmarks at 270476 pystones/second

I can also run two benchmarks in parallel on different CPU cores using the following script that utilizes docker run’s --cpusets-cpus flag:

from subprocess import check_call
from threading import Thread
from sys import argv

def run_on_cpu(cpu):
    cpu = int(cpu)
    check_call(
        f"docker run --privileged -v /home/itamarst:/home --cpuset-cpus={cpu} python:3.10-slim python /home/pystone.py 100_000".split()
    )

Thread(target=lambda: run_on_cpu(argv[1])).start()
run_on_cpu(argv[2])

If I run two benchmarks in parallel with the processor cores identified by Linux as 1 and 2, both run at full speed:

$ python3 parallel.py 1 2
Pystone(1.1) time for 100000 passes = 0.377883
This machine benchmarks at 264632 pystones/second
Pystone(1.1) time for 100000 passes = 0.370839
This machine benchmarks at 269659 pystones/second

If I run them both on the same core, the speed gets cut in half. The Linux scheduler keeps swapping them back and forth in an attempt to be fair, so each only gets approximately half as much time running on the CPU:

$ python3 parallel.py 2 2
Pystone(1.1) time for 100000 passes = 0.75122
This machine benchmarks at 133117 pystones/second
Pystone(1.1) time for 100000 passes = 0.720972
This machine benchmarks at 138702 pystones/second

Virtual cores (“hyperthreads”) give you a bit more parallelism

So far we’ve been talking about cores, and we saw on the Intel site that my CPU has 4 cores. Let’s see how many cores Linux thinks my computer has:

$ cat /proc/cpuinfo | grep processor
processor       : 0
processor       : 1
processor       : 2
processor       : 3
processor       : 4
processor       : 5
processor       : 6
processor       : 7

That’s twice as many as the number of cores. What’s going on?

This CPU has 4 cores, but it has 8 “threads”, “hyperthreads”, or “virtual cores”. This is a hardware feature, not to be confused with operating system threads, so I’m going to stick to “virtual core” for the rest of the article to prevent ambiguity.

Essentially, CPU cores have a bunch of potential for parallelism that isn’t always fully utilized. So the hardware designers split each core into two virtual cores: they look like regular cores, but are really sharing the same hardware. Under some circumstances, this will enable extracting more performance out of the CPU.

Note: On PCs you can often disable virtual cores by disabling “hyperthreading” in the BIOS.

The /proc/cpuinfo information tells us which physical cores underly the “processor” numbers Linux identifies (i.e. virtual cores):

$ cat /proc/cpuinfo | grep 'processor\|core id'
processor       : 0
core id         : 0
processor       : 1
core id         : 1
processor       : 2
core id         : 2
processor       : 3
core id         : 3
processor       : 4
core id         : 0
processor       : 5
core id         : 1
processor       : 6
core id         : 2
processor       : 7
core id         : 3

So “processor” 0 and 4 have the same core ID, as do 1/5, 2/6, and 3/7.

Let’s try running our benchmark in parallel on the same physical core, split across two virtual cores:

$ python3 parallel.py 3 7
Pystone(1.1) time for 100000 passes = 0.704348
This machine benchmarks at 141975 pystones/second
Pystone(1.1) time for 100000 passes = 0.680433
This machine benchmarks at 146965 pystones/second

Previously, we were running on two separate physical cores, so we got full performance from each parallal benchmark. Here, we are on two virtual cores sharing the same physical core, and so performance takes a hit.

Does that mean hyperthreading is useless? Not necessarily: whether you get a performance boost depends on the workload you’re running. But virtual cores are very much not full cores in terms of ability to add parallelism.

A summary: the impact of process location on performance

To simplify heavily, here’s a summary of the performance impacts of which core a pair of processes runs on. 100% means the performance you’d get from a single core running a single process at maximum speed; higher numbers are better.

Configuration Process 1 performance Process 2 performance Total
Different physical cores 100% 100% 200%
Same physical core 50% 50% 100%
Different virtual cores on same physical core 50%-60% 50-60% 100%-120%

The 50%-60% is a bit hand-wavy, as the benefits from virtual cores vary by workload and CPU model, but in theory in some situations you will get a performance boost.

Sidenote: beyond cores, there are other shared CPU resources

Note that cores are just one of the limited, shared resources being divided between different processes. There are also, for example, memory caches. Some of those caches are per-core, but the largest L3 cache is typically shared across multiple cores.

So if the L3 cache is your performance bottleneck, processes can slow each other down even when the CPU cores are not a bottleneck.

Cloud computing, virtualization, and noisy neighbors

So far we’ve talking about physical computers. When you’re running in the cloud, however, you’re usually running on a virtual machine. The virtual machine pretends (up to some level of realism) to be a fully fledged computer, but it’s actually only getting a subset of the physical computer’s resources.

Typically cloud vendors describe their VMs in terms of how many “vCPUs” you get. What’s a vCPU?

This is “thread” as in “virtual core”: a vCPU in the cloud is a virtual core, not a full physical core. Remember that when we had two processes running on two virtual cores mapped to the same physical core, they both experienced a significant slowdown?

This means:

  • If you’re running on a single vCPU, a process on some other virtual machine from some other cloud customer might be causing your process to slow down, because it’s sharing the same physical core.
  • If you’re running on a pair of vCPUs it’s unclear to me whether they are the same physical core or not. Either way there is either guaranteed (from your processes) or possible (from other VMs) slowdowns if you’re trying to maximize both vCPUs.

Depending on instance type cloud vendors will also have various other limits on CPU availability; for example some instances will have low CPU usage allowed but allow short term higher CPU usage bursts. Plus, in some instances, a vCPU actually does map to a physical core; see the instance-specific documentation for your cloud vendor.

There are also other shared resources like L3 memory caches that might mean other virtual machines’ processes can slow you down. Hardware vendors have various ways for cloud and hosting vendors to mitigate this, like Intel’s Resource Directory Technology.

This general problem is known as the “noisy neighbor” problem: other virtual machines running on the same physical machine can slow down your processes. The name is a bit of misnomer, since in the real world you can actually hear your neighbors and know that’s the problem. In the cloud computing context, other virtual machines are designed to be as opaque as possible to your virtual machine, not least for security reasons.

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks
A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage

Implications for batch processing

If you’re running long-running CPU-bound batch processes, you need to understand the hardware and environment you’re running on and utilize them appropriately.

On physical machines: A single process per-core is a reasonable starting point, but you need to figure out whether utilizing virtual cores actually help or not. It may be you’re better off with just one process per physical core; try measuring both options to compare.

In the cloud: Keep in mind that individual vCPUs aren’t physical CPU cores, they’re virtual cores with the corresponding caveats. In addition, different instance types may have very different long-term performance characteristics. Many cloud instances are optimized for web applications, where short-term bursts of CPU usage may suffice; your needs may be different, and require appropriate instance types and vCPU numbers.