Building on solid ground: ensuring reproducible Docker builds for Python
Sometime last month you built a Docker image for your Python application. Today you start with the same revision, fix a minor bug, and build a new image from scratch.
And suddenly you’ve got a mess on your hands.
If your build is not reproducible, you might end up installing different versions of your Python dependencies, system packages, and perhaps even a different version of the operating system. The resulting image might have new bugs, behave in unexpected ways, or even fail to work completely due to incompatible changes.
If your build isn’t reproducible, a minor bug fix can spiral out of control into a series of unwanted and unnecessary major version upgrades.
But if your build is reproducible, the same inputs resulting in the same output, your new image will be mostly the same as your old image: the only difference will be the bug fix.
Yes, you’ll need to update dependencies over time, if only for security fixes. But you’ll be able to get controlled change, instead of sudden unexpected avalanches of changes.
There are multiple layers of reproducibility, from operating system to Python dependencies, so let’s see how to deal with each.
The base image
Typically you’ll build your image off some operating system base image.
My default suggestion for a base image is the official Python images, which are based on Debian GNU/Linux. But any operating system with long-term support, guaranteeing stability of libraries and binaries, will make a good choice—see here for an extended discussion.
There are different variants of the official Python base image, and some will lead to non-reproducible builds. Here are the ones you should avoid:
python:3will install the latest version of Python 3 on the latest version of Debian. If a new version of Debian is released, suddenly your base operating system has changed. If a new version of Python is released, suddenly your image is using Python 3.9 instead of Python 3.8.
python:3.8is a little better, but you will still get a completely different release of Debian when its next stable version is released.
To ensure stability of operating system, you’ll want to specify the tag variant that uses the base OS:
python:3.8-slim-busterare the latest sub-release of Python 3.8, installed on top of Debian “Buster” 10, with slim and large variants. At the time of writing this will be 3.8.2, later it will be 3.8.3, and so on.
python:3.8.2-slim-busteris a specific sub-release, Python 3.8.1.
Both images might still change over time, however; they will be updated with newer releases of
pip, for example, and rebuilt with newer system packages.
If you want complete reproducibility, the exact same base image, you can:
- Specify it by hash, e.g.
python@sha256:89d719142de465e7c80195dff820a0bbbbba49b148fbd97abf4b58889372b5e3is a specific image, unchanging even if the tags get pointed at new images. Unfortunately there’s no guarantee the Docker Hub will continue store older releases.
- You can copy the image from the Docker Hub to your own registry, with stable tags you never overwrite, and then use that as a base image, e.g.
- You can use the Bitnami Python images, which promise to provide permanent tags that always point at the same image.
Sometimes the system packages installed in the base image aren’t sufficient.
In that case you will need to install additional packages, for example by using
apt-get on Debian/Ubuntu or
dnf on CentOS.
The question at this point is how reproducible you want your build to be.
One of the benefits of using a base operating system like Debian stable, Ubuntu LTS, or CentOS is compatibility over time. As long as you stick to a major release, the maintainers will try to release critical bug fixes and security updates to libraries without making incompatible changes.
This is the theory.
In practice, that might not be good enough for you. If you really want to ensure specific package versions get installed, instead of doing:
RUN apt-get install -y nginx
You can install a specific release:
RUN apt-get install -y nginx=1.14.2-2+deb10u1
dnf supports a similar syntax.
Pinning your Python dependencies
The final step to reproducible builds is making sure you get reproducible Python dependencies installed.
If you run:
RUN pip install flask
You will get one version today, and potentially a very different version in 6 months.
You can pin to a specific version:
RUN pip install flask==1.1
But Flask depends on many other libraries, and those dependencies might change out from under you.
So when you install Python dependencies you want to install specific versions of the all the transitive dependencies; that is, the dependencies, and the dependencies’ dependencies, and so on, all pinned to particular versions.
Some tools to make this easy to do are
Of the three,
pip-tools is the simplest tool; it will take a
requirements.in file that looks like this:
And output a transitively pinned
click==7.0 # via flask flask==1.1.1 itsdangerous==1.1.0 # via flask jinja2==2.10.1 # via flask markupsafe==1.1.1 # via jinja2 werkzeug==0.16.0 # via flask
Make sure to keep
requirements.in checked in to source code; you’ll need it when it comes time to update your dependencies.
If you use Conda, you can similarly pin all dependencies by using the output of
conda env export as your
environment.yml (with some caveats I will eventually get around to writing about; at a minimum you might need to fix the sort order of the channels).
A good starting point
Reproducibility can take work, so if you’re in a hurry here’s a minimal version that should work for many use cases:
python:3.8-slim-busteras your base image.
pip-toolsto generate a transitively-pinned
- Update your dependencies at appropriate times.
With just half an hour’s work you’ll have a far more reliable and reproducible image. Go do it now, before your next bugfix breaks your application by unnecessarily upgrading all its dependencies.
Learn how to build fast, production-ready Docker images—read the rest of the Docker packaging guide for Python.
Join my live, online class to learn more Docker packaging best practices for Python in production
You’re about to ship your Python application into production using Docker: your images are going to be critical infrastructure.
On June 11th and 12th, learn how to create production-ready packaging for Python applications by joining live online class.
Upgrade you skills with online training:
Production-ready Docker packaging for Python, June 11th+12th
Learn what it takes to package a Python application for production with Docker, from security to reproducibility to speed, with this live, two half-days online class.