Farewell to fsync(): 10× faster database tests with Docker

You’re writing an application that talks to a database—PostgreSQL, MySQL, MongoDB—and you need to write some tests. You want three things out of your tests:

  1. Realism: the closer your test setup is to production, the more likely you are to catch bugs with tests.
  2. Speed: Slow tests are expensive tests.
  3. Ease-of-use: You want running and writing tests to be easy, otherwise you’re less likely to do it.

The problem is that interacting with a database is slow, so you want to avoid writing tests that talk to the database. But if you’re not talking to a real database, how do you ensure realism?

Luckily, with modern tools like Docker, you can actually have all three at the same time: realism, ease-of-use, and speed! But before we get to our proposed solution, let’s see an example of slowness, and then consider some other alternative solutions.

An example: slow database INSERTs

Consider the following program, which simulates a test suite that uses a database by doing 10,000 database inserts:

import psycopg2

def go():
    conn = psycopg2.connect(
        "user=postgres host=localhost port=5432")
    conn.autocommit = True
    cursor = conn.cursor()
    cursor.execute("CREATE DATABASE test;")
    conn = psycopg2.connect(
        "user=postgres host=localhost port=5432 dbname=test"
    )
    conn.autocommit = True
    cursor = conn.cursor()
    cursor.execute(
        "CREATE TABLE test (id serial PRIMARY KEY, " +
        "num integer, data varchar);")
    for i in range(10000):
        cursor.execute(
            "INSERT INTO test (data) VALUES ('hello!');")

go()

When I run it, it takes about 25 seconds:

$ time python dbtests.py
real    0m26.655s
user    0m0.479s
sys     0m0.384s

Some less-than-ideal solutions

What are some of the ways people deal with this problem?

Test doubles: You can write a fake or mock to replace the code that interacts with the database. Unfortunately, even when you use verified fakes you still suffer from a lack of a realism. Your database can enforce many constraints—uniqueness, relationships, and so on—that are time-consuming and therefore expensive to implement in a fake.

SQLite: If you use a RDBMS like PostgreSQL in production, it can be replaced by SQLite running in memory. This is faster, but now you’re using a different database, which has different features and different bugs.

Rollback before commit: Another solution is to talk to the real database, but ROLLBACK your transaction before it’s committed. This allows you to skip the most expensive step of your database interaction. Unfortunately this isn’t possible in all applications, and can limit the kinds of tests you can do.

These last two solutions present a starting point for a better solution, by pointing us at the slowest part of a database transaction: writing to disk.

Writing a file is fast, writing to disk is slow

Writing to an actual hard drive is quite slow compared to writing to memory. So when a program writes a file, by default it doesn’t get written to disk: instead, it gets stored in memory, and at some point the operating system will write that memory to disk.

But by that point the process has moved on to doing something else. Most of the time, then, writing to a file appears to be quite fast from the process’ point of view.

Unlike many programs, databases want data written to disk immediately: when you commit a transaction, they want to be sure that data will be there. And so they force the data to disk immediately, e.g. by calling the POSIX fsync() API. As a result, database commits are quite slow because they have to wait for the disk writes.

When you’re running tests, however, you don’t care about safely storing data! If your tests crash and you lose all the data, who cares?

If you could just tell your database that you don’t care about safety, everything would run much much much faster. In fact, it would run almost 20× faster:

$ time python dbtests.py
real    0m1.319s
user    0m0.117s
sys     0m0.103s

We went from 25 seconds to 1.5 seconds! Let’s see how you can do that in your own test setup.

Docker as a controlled database environment

One of the criteria for writing tests we mentioned at the start was ease-of-use, and Docker makes it super-easy to run a database. Here’s all you have to do to get PostgreSQL running:

$ docker run --rm -p 5432:5432 postgres:11.1

That will download an image and start up PostgreSQL v11.1, available on port 5432, running in its own isolated filesystem. You can startup 2, or 3, or 17 servers just as easily.

And even better, you have the ability control the runtime environment for the process.

First, you can substitute a configuration file for the default one, and in that configuration file you can set the appropriate option to disable forcing writes to disk. The documentation for the Docker PostgreSQL image explains how to do this for that database.

Second, for the PostgreSQL image you can pass configuration options on the command line. This is all it takes to get the performance improvement you want:

$ docker run --rm -p 5432:5432 postgres:11.1 -c fsync=off

Third, you can mount a custom directory at any point within the isolated process filesystem. You can mount a local directory from your main filesystem, or more interestingly for our purposes, when running Docker on Linux you can mount a RAM-based filesystem, so all writes go to memory:

$ docker run --rm \
  --mount type=tmpfs,destination=/var/lib/postgresql/data \
  -p 5432:5432 postgres:11.1

This latter option is the most generic, in that it’s the same for all databases; you just need to know which directory to override. But it does require you being able to fit all data in RAM.

The ideal choice

So here is the ideal testing choice for most situations:

  1. Writing tests that just use the database normally—as it would be used in production—gives you realism.
  2. Using Docker gives you ease-of-use: it’s very easy to start up a new, isolated database instance.
  3. Disabling fsync() gives you speed. Using database-specific options, or just Docker’s ability to override a directory with a RAM disk, you can get order of magnitude improvements in performance without changing the way you interact with the database. (Pretty much all databases have a way to disable fsync() and friends.)

No need to fear the database: with just a little setup, you can write tests that are fast, realistic, and easy to run.




You might also enjoy:

» When your CI is taking forever on AWS EC2, it might be EBS
» Fast tests for slow services: why you should use verified fakes