When your CI is taking forever on AWS, it might be EBS

You’re running you test suite or your Docker image packaging on a EC2 server. And it’s slow.

  • docker pull takes 15 minutes just to verify the images it downloaded in 1 minute.
  • apt or dnf installs take another 15 minutes.
  • pip install or conda install take even more time.

It’s a fast machine with plenty of memory—in fact, you may have be using a custom machine precisely because Travis CI or Circle CI builders are underpowered. And yet it’s still taking forever.

Why?

Quite possibly the problem is with your EBS disk’s IOPS.

IOPS as bottleneck

Your EC2 virtual machine has a virtual hard drive, typically an AWS EBS volume. And these drives have limited I/O operations per second (“IOPS”), a limited number of reads and writes per second.

For the default general purpose gp2 disk type there are two limits:

  • The standard IOPS, 3 IOPS per GiB of storage, with a minimum of 100 regardless of volume size. If you have a 100GiB EBS volume it will do 300 IOPS; a 500GiB volume will do 1500 IOPS.
  • The burst IOPS of 3000.

The way the burst IOPS works is that you get a 5.4 million credit, and that gets used up at a 3000/sec rate. Once the credit is used up you’re back to the minimum IOPS, and over time the credit rebuilds. (You can get the full details in the AWS documentation).

For application servers, this works great: you’re not doing a lot of I/O once your application has started running. For CI workloads—tests and packaging—limited IOPS can be a performance problem.

When you download a Docker image, operating system package, or Python package, you are doing lots and lots of disk I/O. The packages get written to disk, they get re-read, they get unpackaged and lots of small files are written to disk. It all adds up.

A few concurrent CI runs might use up all of your burst IOPS—and if you have a 100GiB hard drive, you suddenly drop from 3000 IOPS to 100 IOPS. And now installing packages is going to take as much as 30× as long, because it takes so much longer to write and read to disk.

Diagnosis

In general this problem is much more likely if you have a small EBS volume (since it will have less IOPS). And you can get a hint that it’s happening if package installs are particularly slow.

But you can also explicitly check.

In the AWS console for EC2, go to Volumes, and look at the Monitoring tab for your particular volume. One of the graphs will be “Burst Balance”. If the balance has flatlined and is at 0%, that means you’ve got no credit left for now, and you’re running at the minimal IOPS.

Solving the problem

Given an existing EBS volume, the easiest way to solve the problem is to increase the size of your gp2 volume. For example, 500GiB will give you 1500 IOPS, a much more reasonable minimum than 300 IOPS you’d get for 100GiB. Other solutions:

  • You can switch to a io1 type EBS volume, that has configurable dedicated IOPS, but it’s probably not worth the cost just for running tests.
  • You can switch to using local instance storage. Some EC2 instances have dedicated NVMe SSD storage with vastly more IOPS, e.g. c5d or m5d instances.

If your test suite is unreasonably slow on EC2, do check for this problem. I’ve seen it take the run time of a Docker build from over an hour (not sure how much over, because at that point it timed out, unfinished) to just 15 minutes—and diagnosing and fixing the problem takes only 5 minutes of your time.