Introduction to Dockerizing for Production

Improve your DevOps skills: learn an iterative process for Dockerizing your code.

Shrinking your Python application’s Docker image: an overview

by Itamar Turner-Trauring
Last updated 17 Jan 2025, originally created 24 May 2021

You’ve finished building the initial Docker image for your Python application, you push it to the registry–and that takes a while, because your image is 2GB. Your image is clearly too large, and so your next step is to try to make your Docker image smaller.

In this article you’ll find an overview of the many techniques you can use to shrink your image, organized approximately by logical order packaging. The focus is on Python, though many of these techniques are more generic. Techniques are broken down by category, with each suggesting follow-up articles covering the details:

Base image.
Docker layers and their impact on image size.
System packages (apt/dnf) and Python packages.
Avoid copying unnecessary files.
Additional tools, tips, and techniques.

Before you begin: should image size be your top priority?

You only have limited time to work on any given task, so it’s important to prioritize and work on the most important tasks first. And when it comes to Dockerizing your application, image size probably isn’t the most important thing to work on.

If you haven’t thought about security, debuggability, or reproducibility for your images, it’s best to focus on those first, and put off optimizations like image size until later. For an overview of my recommended process for Dockerizing your package, consider reading my free Introduction to Dockerizing for Production mini-ebook.

Base image

The starting point for your image is typically a base image of some sort. Your options include:

Alpine-based images, which are quite small; a fine choice for Go, but sometimes a bad idea for Python.
The slim Debian-based official Python images, or perhaps the latest Ubuntu LTS–see this overview on choosing a base image for Python.
Google’s so-called “distroless” images, which are indeed quite tiny.

Note that base image choice also needs to be traded off against other criteria: access to system packages, Python performance, build time (potentially relevant to Alpine), and compatibility (again, relevant to Alpine).

A different approach is to choose whichever image is most convenient, and then use the slim tool to remove all files your application doesn’t touch. The tool works by runtime instrumentation, so you have to ensure there are files your application might open are actually opened.

Docker layers and their impact on image size

Docker’s image format is comprised of layers, much like Git commits. You can see layer size using the docker history command. And as with Git commits, once you’ve added some files to a layer, they are always there.

For example, let’s say you have to download a 100MB file to build your image. The following Dockerfile will still have that 100MB file lying around, because each RUN command adds a layer:

RUN wget https://example.com/largefile.tar.gz
RUN tar xvfz largefile.tar.gz
RUN largefile/install.sh
# BAD, This will not shrink your image:
RUN rm -rf largefile.tar.gz largefile/

Instead, you can combine these commands into a single layer, and then the temporary files won’t end up in the image:

# GOOD, temporary files deleted before RUN ends:
RUN wget https://example.com/largefile.tar.gz && \
    tar xvfz largefile.tar.gz && \
    largefile/install.sh && \
    rm -rf largefile.tar.gz largefile/

Note: Outside any specific best practice being demonstrated, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.

Need to ship quickly, and don’t have time to figure out every detail on your own? Read the concise, action-oriented Python on Docker Production Handbook.

When combining doesn’t work

Combining layers doesn’t work across COPY and RUN:

COPY installer.sh .
# BAD, installer.sh is still in previous layer:
RUN ./installer.sh && rm -f installer.sh

So what can you do?

One approach is the [`docker-squash` tool](https://github.com/goldmann/docker-squash), that let's you combine multiple layers into one.

A more standard approach, using built-in Docker functionality, is to use multi-stage builds. In a multi-stage build you have one image where you build everything, and then another image which just has the final artifacts you need to run your code.

System packages and Python packages

When it comes to installing both system packages and Python packages, we’d like to:

Not store index files (“this Debian repository has these packages available”).
Not store the downloaded packages once they’re installed.
Avoid installing unnecessary files, like documentation.

See my article on installing system packages for details on doing this RPM and Debian-based systems.

For Python packages:

You can pass the --no-cache-dir option to pip install to avoid keeping copies of downloaded files.
Other packaging tools for Python should have similar options.
You can alternatively use BuildKit caching with pip or other tools, which also helps speed up your builds during development.

Avoid copying in unnecessary files

When you COPY files into your image, they will make your image bigger. If you need those files, that’s fine. If you don’t, it’s a waste of space.

You can use the .dockerignore file to list files you don’t want copied in.
You can explicitly list which files to COPY in, also useful to avoid leaking secrets. For example, “COPY mycode/ setup.py /app” instead of “COPY . /app”.

Additional tools, tips, and techniques

Other ideas:

Avoid running chown on COPYied files, do COPY --chown instead.
dive is a much more sophisticated way to go through layers and see which images are responsible for image size.
You can build your own custom distroless images using a technique involving RedHat dnf features and multi-stage builds.
For Conda users, see this article with a bunch of useful tips, or use an alternative techniques based on conda-pack.

A final reminder

As mentioned above, image size is probably the last thing you should work on; only start working on it once you’ve ensured your image is ready for production usage in other ways, starting with security. But when the time comes, you can make significant improvements by using the techniques above.

The concise and action-oriented guide to Docker packaging for production

Docker packaging for production is complicated, with as many as 70+ best practices to get right. And you want small images, fast builds, and your Python application running securely.

Take the fast path to learning best practices, by using the Python on Docker Production Handbook.

Free ebook: "Introduction to Dockerizing for Production"

Learn a step-by-step iterative DevOps packaging process in this free mini-ebook. You'll learn what to prioritize, the decisions you need to make, and the ongoing organizational processes you need to start.

Plus, you'll join over 8000 people getting weekly emails covering practical tools and techniques, from Docker packaging to Python best practices.