500× faster: Four different ways to speed up your code

If your Python code is slow and needs to be fast, there are many different approaches you can take, from parallelism to writing a compiled extension. But if you just stick to one approach, it’s easy to miss potential speedups, and end up with code that is much slower than it could be.

To make sure you’re not forgetting potential sources of speed, it’s useful to think in terms of practices. Each practice:

  • Speeds up your code in its own unique way.
  • Involves distinct skills and knowledge.
  • Can be applied on its own.
  • Can also be applied together with other practices for even more speed.

To make this more concrete, in this article I’ll work through an example where I will apply multiple practices. Specifically I’ll be demonstrating the practices of:

  1. Efficiency: Getting rid of wasteful or repetitive calculations.
  2. Compilation: Using a compiled language, and potentially working around the compiler’s limitations.
  3. Parallelism: Using multiple CPU cores.
  4. Process: Using development processes that result in faster code.

We’ll see that:

  • Applying just the Practice of Efficiency to this problem gave me a 2.5× speed-up.
  • Applying just the Practice of Compilation gave me a 13× speed-up.
  • When I applied both, the result was even faster.
  • Following up with the Practice of Parallelism gave even more of a speedup, for a final speed up of 500×.

Our example: Counting the frequency of letters

We have a book written in English, Jane Austen’s Northanger Abbey:

with open("northanger_abbey.txt") as f:
    TEXT = f.read()

Our goal is to see the relative frequency of letters in the book. Are vowels more frequent than consonants? Which vowel is the most frequent?

Here is a simple first pass implementation:

# Counter is a dictionary that sets a default value of 0 for
# missing entries:
from collections import Counter

def frequency_1(text):
    counts = Counter()
    for character in text:
        if character.isalpha():
            counts[character.lower()] += 1
    return counts

Here’s the result of running it:

sorted(
    (count, letter) for (letter, count)
    in frequency_1(TEXT).items()
)
[(1, 'à'),
 (2, 'é'),
 (3, 'ê'),
 (111, 'z'),
 (419, 'q'),
 (471, 'j'),
 (561, 'x'),
 (2016, 'k'),
 (3530, 'v'),
 (5297, 'b'),
 (5404, 'p'),
 (6606, 'g'),
 (7639, 'w'),
 (7746, 'f'),
 (7806, 'y'),
 (8106, 'c'),
 (8628, 'm'),
 (9690, 'u'),
 (13431, 'l'),
 (14164, 'd'),
 (20675, 's'),
 (21107, 'r'),
 (21474, 'h'),
 (22862, 'i'),
 (24670, 'n'),
 (26385, 'a'),
 (26412, 'o'),
 (30003, 't'),
 (44251, 'e')]

Unsurprisingly, the most common letter is "e".

How can we make this function faster?

Measurement and testing as part of The Practice of Process

Software development relies on artifacts: source code, libraries, interpreters, compilers. But it also relies on processes, activities you do, and speeding up your code is no different. In this article we’ll see two processes you should always apply when speeding up your software:

  1. Measuring the speed of your code, both through benchmarking and through profiling.
  2. Testing your optimized code to make sure it has the same behavior as the original code.

We can start by profiling the function, using the line_profiler tool, to see which lines take the most time:

Line #      Hits   % Time  Line Contents
========================================
     5                     def frequency_1(text):
     6         1      0.0      counts = Counter()
     7    433070     27.1      for character in text:
     8    433069     24.3          if character.isalpha():
     9    339470     48.6              counts[character.lower()] += 1
    10         1      0.0      return counts

Do less work, or, The Practice of Efficiency

The Practice of Efficiency involves figuring out ways to do less work for the same results. It operates at a relatively high level of abstraction; we don’t think about how the CPU works, for example. These are therefore optimizations that can work in most programming languages, because they are about changing the logic of your computation to be less wasteful.

Do less work in inner loops

If we look at our profiling result above, 50% of the functions’s runtime is spent doing counts[character.lower()] += 1. And certainly it seems wasteful to call character.lower() on every single alphabetic character in the text. We’re turning "I" into an "i" over and over again, not to mention turning every "i" into a "i".

Instead of doing the lowercasing in the for loop, we can store counts for upper case and lower case letters separately, and then at the end combine them:

def frequency_2(text):
    split_counts = Counter()
    for character in text:
        if character.isalpha():
            split_counts[character] += 1

    counts = Counter()
    for character, num in split_counts.items():
        counts[character.lower()] += num
    return counts

# Make sure the new function gives the same result as the
# old one:
assert frequency_1(TEXT) == frequency_2(TEXT)

Notice the assert, part of the Practice of Process. A function that is faster but broken is not a useful function. While you can’t see it in the final result, these asserts helped me catch some bugs in my code while I was writing the article.

When benchmarked (part of the Practice of Process) this optimization does speed up our code:

Code Elapsed microseconds
frequency_1(TEXT) 49,727.9
frequency_2(TEXT) 38,225.0

Optimize your code and outputs for your specific data

We’re going to continue applying the Practice of Efficiency by optimizing the code for our specific goals and data. Let’s profile the latest version of the code:

Line #      Hits   % Time  Line Contents
========================================
     1                     def frequency_2(text):
     2         1      0.0      split_counts = Counter()
     3    433070     30.1      for character in text:
     4    433069     27.5          if character.isalpha():
     5    339470     42.4              split_counts[character] += 1
     6                     
     7         1      0.0      counts = Counter()
     8        53      0.0      for character, num in split_counts.items():
     9        52      0.0          counts[character.lower()] += num
    10         1      0.0      return counts

We can see that split_counts[character] += 1 is still taking a lot of our run time. How can we speed it up? By stopping using a Counter (essentially a dict) and replacing it with a list. Indexing into a list is much faster than indexing into a dict:

  • Storing an entry in a list just involves indexing into an array.
  • Storing an entry in a dict involves hashing an object and possibly some equality comparisons, then indexing into an internal array.

Unlike a dict, whose key can be a string, the index of the list needs to be an integer, so we need to convert characters to numbers. Characters have a numeric value we can look up with the built-in ord() function:

ord('a'), ord('z'), ord('A'), ord('Z')
(97, 122, 65, 90)

And we can turn a numeric value back into a character with chr():

chr(97), chr(122)
('a', 'z')

So we can do something like my_list[ord(character)] += 1 and it should work… but only if we know how many entries the list has in advance. If we’re accepting arbitrary alphabetical characters, the list might need to be very large:

ideograph = '𫞉'
ord(ideograph), ideograph.isalpha()
(178057, True)

Let’s consider our goals:

  1. We’re dealing with an English text, that was part of our problem definition.
  2. If we look at the output, it does contain a sprinkling of extra non-standard-English letters like 'à', but these are extremely rare. (Technically that probably should have been counted as an 'a', but I failed to implement the relevant logic…)
  3. The exact counts aren’t really the goal, we care about relative frequency.

So I’m going to decide that our output only needs to include A through Z, and no other characters need to be counted. Plus, I’m not going to count characters with accents at all. For English, this should not meaningfully alter the relative frequency of letters.

Now we have a simpler problem: instead of “count all alphabetic characters”, it’s “count all characters from A to Z”. Notably that is a limited and known and small set of numbers. And that means we can use a list instead of a dict!

Here’s our optimized implementation:

def frequency_3(text):
    # Make a 128-long list of zeros; recall that ord('z')
    # was 122 so this should suffice:
    split_counts = [0] * 128
    for character in text:
        index = ord(character)
        if index < 128:
            split_counts[index] += 1

    counts = {}
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        counts[letter] = (
            split_counts[ord(letter)] +
            split_counts[ord(letter.upper())]
        )
    return counts

The output won’t be exactly the same, since now we’re only outputting A to Z, so we need a slightly different check of correctness:

def assert_matches(counts1, counts2):
    """Match A to Z counts."""
    for character in 'abcdefghijklmnopqrstuvwxyz':
        assert counts1[character] == counts2[character]

assert_matches(
    frequency_1(TEXT),
    frequency_3(TEXT)
)

Our new implementation is even faster:

Code Elapsed microseconds
frequency_2(TEXT) 37,728.7
frequency_3(TEXT) 19,292.4

The Practice of Compilation: Switching to a faster language

Next we’re going to switch to a compiled language, Rust.

Now, I could have just ported frequency_1() to Rust and stopped there. And it’s true that a compiler can apply some optimizations that in Python you’d have to do manually.

But much of the time it’s up to you to apply the Practice of Efficiency, whatever the language. That’s why Efficiency and Compilation are distinct practices: because they are distinct sources of performance. And so the optimizations we did in frequency_2() and frequency_3() also make Rust code faster.

To demonstrate this, I ported all three of the Python functions above to Rust (click to see the source code of the first two):

The first two versions, reimplemented in Rust
#[pyfunction]
fn frequency_1_rust(
    text: &str,
) -> PyResult<HashMap<char, u32>> {
    let mut counts = HashMap::new();
    for character in text.chars() {
        if character.is_alphabetic() {
            *counts
                .entry(
                    character
                        .to_lowercase()
                        .next()
                        .unwrap_or(character),
                )
                .or_default() += 1;
        }
    }
    Ok(counts)
}

#[pyfunction]
fn frequency_2_rust(
    text: &str,
) -> PyResult<HashMap<char, u32>> {
    let mut split_counts: HashMap<char, u32> =
        HashMap::new();
    for character in text.chars() {
        if character.is_alphabetic() {
            *split_counts.entry(character).or_default() +=
                1;
        }
    }

    let mut counts = HashMap::new();
    for (character, num) in split_counts.drain() {
        *counts
            .entry(
                character
                    .to_lowercase()
                    .next()
                    .unwrap_or(character),
            )
            .or_default() += num;
    }
    Ok(counts)
}


And here’s what the third version looks like in Rust:

fn ascii_arr_to_letter_map(
    split_counts: [u32; 128],
) -> HashMap<char, u32> {
    let mut counts: HashMap<char, u32> = HashMap::new();
    for index in ('a' as usize)..=('z' as usize) {
        let character =
            char::from_u32(index as u32).unwrap();
        let upper_index =
            character.to_ascii_uppercase() as usize;
        counts.insert(
            character,
            split_counts[index] + split_counts[upper_index],
        );
    }
    counts
}

#[pyfunction]
fn frequency_3_rust(text: &str) -> HashMap<char, u32> {
    let mut split_counts = [0u32; 128];
    for character in text.chars() {
        let character = character as usize;
        if character < 128 {
            split_counts[character] += 1;
        }
    }

    ascii_arr_to_letter_map(split_counts)
}

All three versions still behave the same:

assert_matches(frequency_1(TEXT), frequency_1_rust(TEXT))
assert_matches(frequency_1(TEXT), frequency_2_rust(TEXT))
assert_matches(frequency_1(TEXT), frequency_3_rust(TEXT))

Benchmarking all 6 versions demonstrates how the performance benefits from the Practice of Efficiency and the Practice of Compilation are distinct and complementary. The efficiency optimizations that sped up Python code also sped up the Rust code; frequency_3() is faster than frequency_1(), and frequency_3_rust() is faster than frequency_1_rust().

Code Elapsed microseconds
frequency_1(TEXT) 50,147.8
frequency_2(TEXT) 37,640.9
frequency_3(TEXT) 19,393.2
frequency_1_rust(TEXT) 3,744.3
frequency_2_rust(TEXT) 3,529.2
frequency_3_rust(TEXT) 205.1

In short, Efficiency and Compilation are distinct sources of speed.

Taking advantage of multiple cores: The Practice of Parallelism

So far we’ve been running on a single CPU core, but most computers have multiple cores. Taking advantage of parallelism is a different source of speed, so it also counts as a separate practice.

Here’s what a parallel version looks like in Rust, utilizing the Rayon library:

fn sum(mut a: [u32; 128], b: [u32; 128]) -> [u32; 128] {
    for i in 0..128 {
        a[i] += b[i];
    }
    a
}

#[pyfunction]
fn frequency_parallel_rust(
    py: Python<'_>,
    text: &str,
) -> HashMap<char, u32> {
    use rayon::prelude::*;

    // Make sure to release the Global Interpreter Lock:
    let split_counts = py.allow_threads(|| {
        // To squeeze more performance out of Rayon, use a
        // trick: the characters we care about will always
        // be represented by a single byte, unambiguously.
        // So looking at bytes is fine, and that allows us
        // to force Rayon to use chunks.
        text.as_bytes()
            // Iterate over chunks in parallel:
            .par_chunks(8192)
            .fold_with(
                [0u32; 128],
                |mut split_counts, characters| {
                    for character in characters {
                        if *character < 128 {
                            split_counts
                                [*character as usize] += 1;
                        };
                    }
                    split_counts
                },
            )
            // Combine all the chunks' results:
            .reduce(|| [0u32; 128], sum)
    });
    ascii_arr_to_letter_map(split_counts)
}

This version still has the same results:

assert_matches(frequency_1(TEXT), frequency_parallel_rust(TEXT))

And here is the resulting speedup:

Code Elapsed microseconds
frequency_3_rust(TEXT) 234.5
frequency_parallel_rust(TEXT) 105.3

Process revisited: Are we measuring the right thing?

Our final function is 500× faster… or is it?

We are measuring the performance of these functions by calling them repeatedly and averaging the runtime. And I happen to have some pre-existing knowledge:

  • Rust represents strings as UTF-8, whereas Python represents strings in its own internal format that is not UTF-8.
  • Python therefore needs to convert strings to UTF-8 when calling Rust.
  • When Python converts a string to UTF-8, if a specific API is used then it caches that converted string.

That means there’s a pretty good chance we’re not measuring the cost of the UTF-8 conversion, since we’re benchmarking against the same TEXT string repeatedly and it has a cached version of the UTF-8 version after the first call. In the real world, we wouldn’t have a cached version.

We can measure this by measuring the time to run a single call with a new string; I’m going to use the non-parallel version since it has a more consistent speed:

from time import time

def timeit(f, *args):
    start = time()
    f(*args)
    print("Elapsed:", int((time() - start) * 1_000_000), "µs")

print("Original text")
timeit(frequency_3_rust, TEXT)
timeit(frequency_3_rust, TEXT)
print()

for i in range(3):
    # New string:
    s = TEXT + str(i)
    print("New text", i + 1)
    timeit(frequency_3_rust, s)
    timeit(frequency_3_rust, s)
    print()
Original text
Elapsed: 212 µs
Elapsed: 206 µs

New text 1
Elapsed: 769 µs
Elapsed: 207 µs

New text 2
Elapsed: 599 µs
Elapsed: 202 µs

New text 3
Elapsed: 625 µs
Elapsed: 200 µs

For new strings, the first run is around 400µs slower than the second one, which is likely the cost of converting to UTF-8.

Of course, the book I am loading is already in UTF-8, so instead of loading it into Python (which involves converting it into a Python string) and then passing it to Rust (which involves converting back into UTF-8), we could just change our API so it passes UTF-8-encoded bytes to the Rust code, and then we won’t have that overhead.

I implemented a new function, frequency_3_rust_bytes(), that is designed to accept UTF-8 encoded bytes (source code omitted, it’s almost exactly the same as frequency_3_rust() except for the inputs). Then, I measured the time to run a single bytestring the first time and the second time:

with open("northanger_abbey.txt", "rb") as f:
    TEXT_BYTES = f.read()

assert_matches(
    frequency_1(TEXT),
    frequency_3_rust_bytes(TEXT_BYTES)
)

print("New text no longer has the ~400µs conversion overhead:")
new_text = TEXT_BYTES + b"!"
timeit(frequency_3_rust_bytes, new_text)
timeit(frequency_3_rust_bytes, new_text)
New text no longer has the ~400µs conversion overhead:
Elapsed: 186 µs
Elapsed: 182 µs

And if we measure sustained average time we can see it more or less matches the previous version:

Code Elapsed microseconds
frequency_3_rust(TEXT) 227.2
frequency_3_rust_bytes(TEXT_BYTES) 183.8

So it looks like we can indeed bypass the UTF-8 cost by passing in bytes. We would probably want to implement a frequency_parallel_rust_bytes() function as well, so as to get the benefit of parallelism without the overhead of the UTF-8 conversion.

Zooming out to the big picture, this is an example of one of the many reasons process is so important: you need to make sure you’re measuring what you think you’re measuring.

Practices of performance: separately and together

Throughout this article I’ve used the Practice of Process: testing that new versions were correct, profiling, and measuring code as I went. Benchmarking caught some dead end optimizations that I didn’t bother showing.

The Practice of Efficiency helped get rid of wasteful work, the Practice of Compilation made the code faster, and Parallelism allowed using multiple CPU cores. Each practice is a distinct and multiplicative source of speed.

In short: if your code to needs to be fast, don’t just stick to one practice. You’ll get much faster code if you apply more than one.

To help you learn these practices and write faster computational Python code, I’m writing a book: Practices of Performance. It covers all the practices mentioned in this article, plus the Practice of Mechanical Sympathy: understanding how CPUs and other relevant computer hardware works. You can learn more about the book here.