Poetry vs. Docker caching: Fight!

Docker packaging is an exercise in shoving square pegs into round holes, over and over and over again.

Consider the Poetry packaging tool for Python. One of Poetry’s features can make Docker rebuilds slower, by breaking Docker’s caching.

And it’s not a bad feature, there’s nothing really wrong with it, it just—doesn’t fit.

Let’s see what the problem is, go over some workarounds—which have their own problems, obviously—and then briefly consider why everything about Docker packaging is always slightly broken.

Recap: faster rebuilds by installing dependencies separately

As a reminder:

  1. When you rebuild a Docker image it can use caching to speed up the rebuild process. The caching will be invalidated if you COPY in a changed file.
  2. When installing your dependencies and code, you’ll therefore want to copy in the dependencies file first, and separately. This lets dependency installation can be sped up by caching even if your code changes.

For example, we copy requirements.txt in first, and install dependencies using it, then COPY in the rest of the code:

FROM python:3.8-slim-buster
COPY requirements.txt /tmp
RUN pip install -r requirements.txt
COPY . /tmp/myapp
RUN pip install /tmp/myapp

Note: Outside any specific best practice being demonstrated, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.

Python on Docker Production Handbook Need to ship quickly, and don’t have time to figure out every detail on your own? Read the concise, action-oriented Python on Docker Production Handbook.

Poetry time

Let’s see how we do this two-step install with Poetry.

Poetry has two relevant files.

  1. The standard pyproject.toml Python config file with Poetry-specific configuration has your high-level dependencies.
  2. poetry.lock contains pinned versions of all transitive dependencies.

We’ll have to copy them both in:

FROM python:3.8-slim-buster

WORKDIR /app

# Install poetry:
RUN pip install poetry

# Copy in the config files:
COPY pyproject.toml poetry.lock ./
# Install only dependencies:
RUN poetry install --no-root --no-dev

# Copy in everything else and install:
COPY . .
RUN poetry install --no-dev

So far, so good: unless our dependencies change, thereby changing pyproject.toml and poetry.lock, Docker image rebuilds will be able to use cached layers because the two copied files won’t have changed.

But there’s a problem.

pyproject.toml: more than just dependencies

As mentioned above, pyproject.toml is where you list dependencies when you’re using Poetry. Let’s take a look at an example:

[tool.poetry]
name = "myexample"
version = "0.1.0"
description = ""
authors = ["Itamar Turner-Trauring"]

[tool.poetry.dependencies]
python = "^3.6"
Flask = "^1.1.2"

# ...

Do you spot the problem?

  1. There’s a version field for your application.
  2. Every time you update that version field, your pyproject.toml changes.
  3. This invalidates the Docker cache when you rebuild your image.
  4. As a result, your Docker build has to install all your dependencies, slowing things down.

Now, quite possibly you only update that field infrequently, and you can live with occasional slow rebuilds. But if you’re doing some sort of continuous deployment process where you’re continuously updating the version field, your Docker builds are going to be slow.

Some workarounds

First, as mentioned above, you can choose not to care.

Second, instead of installing dependencies with Poetry, you can install them with pip. Specifically, you can use poetry export to create a standalone requirements.txt, and then just copy the requirements.txt in instead of pyproject.toml and poetry.lock.

The downside is that you need Poetry installed both in and outside the Docker image in your CI build, and this isn’t quite how Poetry normally installs.

Third, you can use poetry-dynamic-versioning, a plug-in for Poetry that uses Git tags instead of pyproject.toml to set your application’s version. That way you won’t have to edit pyproject.toml to update the version.

This seems appealing until you realize you now need to copy .git into your Docker build, which has its own downsides, like larger images unless you’re using multi-stage builds.

A different plugin, poetry-version-plugin also supports reading the version from a Python file, bypassing this problem, although it’s marked as experimental.

Fourth, this is conceivably something Poetry could fix. The problem is that pyproject.toml serves multiple purposes: versions, dependencies, and more. Unlike a full install, however, for the purpose of installing dependencies you probably only need poetry.lock, so Poetry could support installing just with that.

I considered filing an issue, but there are already hundreds of issues in the tracker and I felt a little bad.

Why is everything broken?

A consistent theme with Docker packaging is that nothing works quite right. Docker packaging interacts badly with everything from Unix signals—a 50-year-old technology!—to quite recent projects like Poetry.

So why is that? Partially, it’s because these technologies have their own issues. For example, the interaction of Unix signals, shells, and terminals is extremely complex to the point where I immediately forget how it works every time I attempt to (re)learn it.

But the problem with Poetry is arguably down to the way Docker’s build works: Dockerfiles are essentially glorified shell scripts, and the build system semantic units are files and complete command runs. There is no way in a normal Docker build to access the actually relevant semantic information: in a better build system, you’d only re-install the changed dependencies, not reinstall all dependencies anytime the list changed.

Hopefully someday a better build system will eventually replace the Docker default. Until then, it’s square pegs into round holes.