Speed up pip downloads in Docker with BuildKit’s new caching
Docker uses layer caching to speed up builds, but layer caching isn’t always enough. When you’re rapidly developing your Python application and therefore frequently changing the list of dependencies, you’re going to end up downloading the same packages.
Over and over and over again.
This is no fun when you depend on small packages. It’s extra no fun when you’re downloading machine learning libraries that take hundreds of megabytes.
With the release of a stable Docker BuildKit, Docker now supports a new caching mechanism that can cache these downloads.
The problem: when caching doesn’t help
Let’s say you have some code with a requirements.txt
listing dependencies:
flask
And a Dockerfile
that uses it to install dependencies:
FROM python:3.9-slim-buster
COPY requirements.txt .
RUN pip install -r requirements.txt
# ... etc. ...
The first time we run this, Docker will of course have to run pip install
from scratch, and pip
will download Flask and its dependencies.
$ Sending build context to Docker daemon 3.584kB
Step 1/3 : FROM python:3.9-slim-buster
---> b55839ea7a0e
Step 2/3 : COPY requirements.txt .
---> 59cc359cfb53
Step 3/3 : RUN pip install -r requirements.txt
---> Running in b0d01b9495b6
Collecting flask
Downloading Flask-1.1.2-py2.py3-none-any.whl (94 kB)
Collecting click>=5.1
Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting itsdangerous>=0.24
Downloading itsdangerous-1.1.0-py2.py3-none-any.whl (16 kB)
...
Successfully tagged example:latest
The second time we run this, Docker’s layer caching kicks in: requirements.txt
hasn’t changed, neither has the Dockerfile
, so there’s no need to rerun pip install
.
$ docker build -t example --progress=plain .
Sending build context to Docker daemon 3.584kB
Step 1/3 : FROM python:3.9-slim-buster
---> b55839ea7a0e
Step 2/3 : COPY requirements.txt .
---> Using cache
---> 59cc359cfb53
Step 3/3 : RUN pip install -r requirements.txt
---> Using cache
---> 974a97388f3f
Successfully built 974a97388f3f
Successfully tagged example:latest
Now, let’s modify requirements.txt
, adding another dependency:
flask
matplotlib
Now when we rebuild, pip install
runs again… and it downloads Flask and all its dependencies all over again!
$ docker build -t example --progress=plain .
Sending build context to Docker daemon 3.584kB
Step 1/3 : FROM python:3.9-slim-buster
---> b55839ea7a0e
Step 2/3 : COPY requirements.txt .
---> 503a903dd4c9
Step 3/3 : RUN pip install -r requirements.txt
---> Running in 9d4aea743390
Collecting flask
Downloading Flask-1.1.2-py2.py3-none-any.whl (94 kB)
Collecting click>=5.1
Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting itsdangerous>=0.24
Downloading itsdangerous-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting Jinja2>=2.10.1
Downloading Jinja2-2.11.2-py2.py3-none-any.whl (125 kB)
Collecting MarkupSafe>=0.23
Downloading MarkupSafe-1.1.1.tar.gz (19 kB)
Collecting Werkzeug>=0.15
Downloading Werkzeug-1.0.1-py2.py3-none-any.whl (298 kB)
Collecting matplotlib
Downloading matplotlib-3.3.4-cp39-cp39-manylinux1_x86_64.whl (11.5 MB)
...
If you’re changing requirements.txt
, you’re going to waste a lot of time waiting for the same packages to download over and over again.
The solution: BuildKit’s new caching
When you’re running pip install
(or Pipenv or Poetry) normally on your computer, it caches downloads in your home directory, so that later installs don’t require redownloading the same package.
That doesn’t work in Docker builds because each build is its own self-contained little filesystem, starting at best from a previously cached layer.
And since the unit of caching is the RUN
command, either you have all the packages downloaded, or none.
To solve this category of problem, BuildKit adds a new kind of caching: you can cache a directory across builds. It should be presumed to get deleted at any point, and in that sense it is quite similar to the directory caching provided by online CI systems.
To use, it we just need to add an extra option to the RUN
.
I’m going to be caching the /root/.cache
directory, since that is also where Pipenv and Poetry will store their files; pip
uses ~/.cache/pip
by default.
# syntax = docker/dockerfile:1.5
FROM python:3.9-slim-buster
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache \
pip install -r requirements.txt
# ... etc. ...
Note: Outside any specific best practice being demonstrated, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.
Need to ship quickly, and don’t have time to figure out every detail on your own? Read the concise, action-oriented Python on Docker Production Handbook.
I’m also going to set the DOCKER_BUILDKIT
environment variable to ensure BuildKit is used (this is unnecessary starting with 23.0 on Linux, and on sufficiently recent Windows/macOS Docker Desktop installs).
$ export DOCKER_BUILDKIT=1
I tweak requirements.txt
to force a build:
matplotlib
flask
Now when I build the Docker image, the first time it will have to download everything from scratch:
$ docker build -t example --progress=plain .
...
#9 [stage-0 3/3] RUN --mount=type=cache,target=/root/.cache pip install -r requirements.txt
#9 sha256:cff2b41a0170dccc42eda05f6a5495d3b00436849f28c262356eaae4a29a804f
#9 2.816 Collecting flask
#9 2.896 Downloading Flask-1.1.2-py2.py3-none-any.whl (94 kB)
#9 3.202 Collecting click>=5.1
#9 3.215 Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
#9 3.226 Collecting itsdangerous>=0.24
#9 3.240 Downloading itsdangerous-1.1.0-py2.py3-none-any.whl (16 kB)
#9 3.246 Collecting Jinja2>=2.10.1
#9 3.269 Downloading Jinja2-2.11.2-py2.py3-none-any.whl (125 kB)
#9 3.405 Collecting MarkupSafe>=0.23
#9 3.423 Downloading MarkupSafe-1.1.1.tar.gz (19 kB)
...
Notice the output format is different; that’s because we’re using BuildKit.
Now, I edit requirements.txt
again:
matplotlib
flask
django
With normal Docker caching I would expect Flask and matplotlib to be downloaded again. But this time:
$ docker build -t example --progress=plain .
...
#9 [stage-0 3/3] RUN --mount=type=cache,target=/root/.cache pip install -r requirements.txt
#9 sha256:42cbbeab455a114b05cbbcaa09a38fa56bc4ce6da820e3957055b164d24ef36f
#9 2.421 Collecting django
#9 2.504 Downloading Django-3.1.5-py3-none-any.whl (7.8 MB)
#9 3.579 Collecting asgiref<4,>=3.2.10
#9 3.597 Downloading asgiref-3.3.1-py3-none-any.whl (19 kB)
#9 3.603 Collecting sqlparse>=0.2.2
#9 3.618 Downloading sqlparse-0.4.1-py3-none-any.whl (42 kB)
#9 3.624 Collecting flask
#9 3.626 Using cached Flask-1.1.2-py2.py3-none-any.whl (94 kB)
#9 3.910 Collecting click>=5.1
#9 3.912 Using cached click-7.1.2-py2.py3-none-any.whl (82 kB)
...
Notice all the “Using cached <package>“—we didn’t have to download Flask, it was found in the local cache!
You can learn more about this and other BuildKit features in the docker/dockerfile
docs.
Some limitations to BuildKit caching
The cached files are stored inside Docker. As such, if you are doing your builds in some sort of cloud CI service that starts with a new environment every time, the cache won’t survive.
You might be able to convince your CI system to cache /var/lib/docker/buildkit/cache.db
(e.g. on GitHub Actions using the cache
action).
I haven’t tried this, so I’m not sure if it will work, but if it does you’ll also save downloads across builds.
But you can at the very minimum use this technique to speed up builds during development, or on CI servers with a persistent filesystem.