Speed up pip downloads in Docker with BuildKit’s new caching

Docker uses layer caching to speed up builds, but layer caching isn’t always enough. When you’re rapidly developing your Python application and therefore frequently changing the list of dependencies, you’re going to end up downloading the same packages.

Over and over and over again.

This is no fun when you depend on small packages. It’s extra no fun when you’re downloading machine learning libraries that take hundreds of megabytes.

With the release of a stable Docker BuildKit, Docker now supports a new caching mechanism that can cache these downloads.

The problem: when caching doesn’t help

Let’s say you have some code with a requirements.txt listing dependencies:

flask

And a Dockerfile that uses it to install dependencies:

FROM python:3.9-slim-buster
COPY requirements.txt .
RUN pip install -r requirements.txt
# ... etc. ...

The first time we run this, Docker will of course have to run pip install from scratch, and pip will download Flask and its dependencies.

$ Sending build context to Docker daemon  3.584kB
Step 1/3 : FROM python:3.9-slim-buster
 ---> b55839ea7a0e
Step 2/3 : COPY requirements.txt .
 ---> 59cc359cfb53
Step 3/3 : RUN pip install -r requirements.txt
 ---> Running in b0d01b9495b6
Collecting flask
  Downloading Flask-1.1.2-py2.py3-none-any.whl (94 kB)
Collecting click>=5.1
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting itsdangerous>=0.24
  Downloading itsdangerous-1.1.0-py2.py3-none-any.whl (16 kB)
...
Successfully tagged example:latest

The second time we run this, Docker’s layer caching kicks in: requirements.txt hasn’t changed, neither has the Dockerfile, so there’s no need to rerun pip install.

$ docker build -t example --progress=plain .
Sending build context to Docker daemon  3.584kB
Step 1/3 : FROM python:3.9-slim-buster
 ---> b55839ea7a0e
Step 2/3 : COPY requirements.txt .
 ---> Using cache
 ---> 59cc359cfb53
Step 3/3 : RUN pip install -r requirements.txt
 ---> Using cache
 ---> 974a97388f3f
Successfully built 974a97388f3f
Successfully tagged example:latest

Now, let’s modify requirements.txt, adding another dependency:

flask
matplotlib

Now when we rebuild, pip install runs again… and it downloads Flask and all its dependencies all over again!

$ docker build -t example --progress=plain .
Sending build context to Docker daemon  3.584kB
Step 1/3 : FROM python:3.9-slim-buster
 ---> b55839ea7a0e
Step 2/3 : COPY requirements.txt .
 ---> 503a903dd4c9
Step 3/3 : RUN pip install -r requirements.txt
 ---> Running in 9d4aea743390
Collecting flask
  Downloading Flask-1.1.2-py2.py3-none-any.whl (94 kB)
Collecting click>=5.1
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting itsdangerous>=0.24
  Downloading itsdangerous-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting Jinja2>=2.10.1
  Downloading Jinja2-2.11.2-py2.py3-none-any.whl (125 kB)
Collecting MarkupSafe>=0.23
  Downloading MarkupSafe-1.1.1.tar.gz (19 kB)
Collecting Werkzeug>=0.15
  Downloading Werkzeug-1.0.1-py2.py3-none-any.whl (298 kB)
Collecting matplotlib
  Downloading matplotlib-3.3.4-cp39-cp39-manylinux1_x86_64.whl (11.5 MB)
...

If you’re changing requirements.txt, you’re going to waste a lot of time waiting for the same packages to download over and over again.

The solution: BuildKit’s new caching

When you’re running pip install (or Pipenv or Poetry) normally on your computer, it caches downloads in your home directory, so that later installs don’t require redownloading the same package. That doesn’t work in Docker builds because each build is its own self-contained little filesystem, starting at best from a previously cached layer.

And since the unit of caching is the RUN command, either you have all the packages downloaded, or none.

To solve this category of problem, BuildKit adds a new kind of caching: you can cache a directory across builds. It should be presumed to get deleted at any point, and in that sense it is quite similar to the directory caching provided by online CI systems.

To use, it we just need to add an extra option to the RUN. I’m going to be caching the /root/.cache directory, since that is also where Pipenv and Poetry will store their files; pip uses ~/.cache/pip by default.

# syntax = docker/dockerfile:1.2
FROM python:3.9-slim-buster
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache \
    pip install -r requirements.txt
# ... etc. ...

Note: Outside the very specific topic under discussion, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.

To ensure you’re writing secure, correct, fast Dockerfiles, consider my Python on Docker Production Handbook, which includes a packaging process and >70 best practices.

I’m also going to set the DOCKER_BUILDKIT environment variable to ensure BuildKit is used.

$ export DOCKER_BUILDKIT=1

I tweak requirements.txt to force a build:

matplotlib
flask

Now when I build the Docker image, the first time it will have to download everything from scratch:

$ docker build -t example --progress=plain .
...
#9 [stage-0 3/3] RUN --mount=type=cache,target=/root/.cache pip install -r requirements.txt
#9 sha256:cff2b41a0170dccc42eda05f6a5495d3b00436849f28c262356eaae4a29a804f
#9 2.816 Collecting flask
#9 2.896   Downloading Flask-1.1.2-py2.py3-none-any.whl (94 kB)
#9 3.202 Collecting click>=5.1
#9 3.215   Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
#9 3.226 Collecting itsdangerous>=0.24
#9 3.240   Downloading itsdangerous-1.1.0-py2.py3-none-any.whl (16 kB)
#9 3.246 Collecting Jinja2>=2.10.1
#9 3.269   Downloading Jinja2-2.11.2-py2.py3-none-any.whl (125 kB)
#9 3.405 Collecting MarkupSafe>=0.23
#9 3.423   Downloading MarkupSafe-1.1.1.tar.gz (19 kB)
...

Notice the output format is different; that’s because we’re using BuildKit.

Now, I edit requirements.txt again:

matplotlib
flask
django

With normal Docker caching I would expect Flask and matplotlib to be downloaded again. But this time:

$ docker build -t example --progress=plain .
...
#9 [stage-0 3/3] RUN --mount=type=cache,target=/root/.cache pip install -r requirements.txt
#9 sha256:42cbbeab455a114b05cbbcaa09a38fa56bc4ce6da820e3957055b164d24ef36f
#9 2.421 Collecting django
#9 2.504   Downloading Django-3.1.5-py3-none-any.whl (7.8 MB)
#9 3.579 Collecting asgiref<4,>=3.2.10
#9 3.597   Downloading asgiref-3.3.1-py3-none-any.whl (19 kB)
#9 3.603 Collecting sqlparse>=0.2.2
#9 3.618   Downloading sqlparse-0.4.1-py3-none-any.whl (42 kB)
#9 3.624 Collecting flask
#9 3.626   Using cached Flask-1.1.2-py2.py3-none-any.whl (94 kB)
#9 3.910 Collecting click>=5.1
#9 3.912   Using cached click-7.1.2-py2.py3-none-any.whl (82 kB)
...

Notice all the “Using cached <package>“—we didn’t have to download Flask, it was found in the local cache!

You can learn more about this and other BuildKit features in the docker/dockerfile docs.

Some limitations to BuildKit caching

The cached files are stored inside Docker. As such, if you are doing your builds in some sort of cloud CI service that starts with a new environment every time, the cache won’t survive.

You might be able to convince your CI system to cache /var/lib/docker/buildkit/cache.db (e.g. on GitHub Actions using the cache action). I haven’t tried this, so I’m not sure if it will work, but if it does you’ll also save downloads across builds.

But you can at the very minimum use this technique to speed up builds during development, or on CI servers with a persistent filesystem.


Learn how to build fast, production-ready Docker images—read the rest of the Docker packaging guide for Python.