Shrinking your Python application’s Docker image: an overview
You’ve finished building the initial Docker image for your Python application, you push it to the registry–and that takes a while, because your image is 2GB. Your image is clearly too large, and so your next step is to try to make your Docker image smaller.
In this article you’ll find an overview of the many techniques you can use to shrink your image, organized approximately by logical order packaging. The focus is on Python, though many of these techniques are more generic. Techniques are broken down by category, with each suggesting follow-up articles covering the details:
- Base image.
- Docker layers and their impact on image size.
- System packages (apt/dnf) and Python packages.
- Avoid copying unnecessary files.
- Additional tools, tips, and techniques.
Before you begin: should image size be your top priority?
You only have limited time to work on any given task, so it’s important to prioritize and work on the most important tasks first. And when it comes to Dockerizing your application, image size probably isn’t the most important thing to work on.
If you haven’t thought about security, debuggability, or reproducibility for your images, it’s best to focus on those first, and put off optimizations like image size until later. For an overview of my recommended process for Dockerizing your package, consider reading my free Introduction to Dockerizing for Production mini-ebook.
The starting point for your image is typically a base image of some sort. Your options include:
- Alpine-based images, which are quite small; a fine choice for Go, but probably a bad idea for Python.
- The slim Debian-based official Python images, or perhaps the latest Ubuntu LTS–see this overview on choosing a base image for Python.
- Google’s so-called “distroless” images, which are indeed quite tiny. However, Python 3 support is “experimental” and it’s not even clear what version you’re getting.
Note that base image choice also needs to be traded off against other criteria: access to system packages, Python performance, build time (particularly relevant to Alpine), and compatibility (again, relevant to Alpine).
A different approach is to choose whichever image is most convenient, and then use the
docker-slim tool to remove all files your application doesn’t touch.
The tool works by runtime instrumentation, so you have to ensure there are files your application might open are actually opened.
Docker layers and their impact on image size
Docker’s image format is comprised of layers, much like Git commits.
You can see layer size using the
docker history command.
And as with Git commits, once you’ve added some files to a layer, they are always there.
For example, let’s say you have to download a 100MB file to build your image.
The following Dockerfile will still have that 100MB file lying around, because each
RUN command adds a layer:
RUN wget https://example.com/largefile.tar.gz RUN tar xvfz largefile.tar.gz RUN largefile/install.sh # BAD, This will not shrink your image: RUN rm -rf largefile.tar.gz largefile/
Instead, you can combine these commands into a single layer, and then the temporary files won’t end up in the image:
# GOOD, temporary files deleted before RUN ends: RUN wget https://example.com/largefile.tar.gz && \ tar xvfz largefile.tar.gz && \ largefile/install.sh && \ rm -rf largefile.tar.gz largefile/
Note: Outside the very specific topic under discussion, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.
To ensure you’re following all the best practices you need to have a secure, correct, fast Dockerfiles, check out the Python on Docker Production Handbook.
When combining doesn’t work
Combining layers doesn’t work across
COPY installer.sh . # BAD, installer.sh is still in previous layer: RUN ./installer.sh && rm -f installer.sh
So what can you do?
One approach is the
docker-squash tool, that let’s you combine multiple layers into one.
A more standard approach, using built-in Docker functionality, is to use multi-stage builds. In a multi-stage build you have one image where you build everything, and then another image which just has the final artifacts you need to run your code.
System packages and Python packages
When it comes to installing both system packages and Python packages, we’d like to:
- Not store index files (“this Debian repository has these packages available”).
- Not store the downloaded packages once they’re installed.
- Avoid installing unnecessary files, like documentation.
See my article on installing system packages for details on doing this RPM and Debian-based systems.
For Python packages:
- You can pass the
pip installto avoid keeping copies of downloaded files.
- Other packaging tools for Python should have similar options.
- You can alternatively use BuildKit caching with pip or other tools, which also helps speed up your builds during development.
Avoid copying in unnecessary files
COPY files into your image, they will make your image bigger.
If you need those files, that’s fine.
If you don’t, it’s a waste of space.
- You can use the
.dockerignorefile to list files you don’t want copied in.
- You can explicitly list which files to
COPYin, also useful to avoid leaking secrets. For example, “
COPY mycode/ setup.py /app” instead of “
COPY . /app”.
Additional tools, tips, and techniques
- Avoid running
COPYied files, do
diveis a much more sophisticated way to go through layers and see which images are responsible for image size.
- You can build your own custom distroless images using a technique involving RedHat
dnffeatures and multi-stage builds.
- For Conda users, see this article with a bunch of useful tips, or use an alternative techniques based on
A final reminder
As mentioned above, image size is probably the last thing you should work on; only start working on it once you’ve ensured your image is ready for production usage in other ways, starting with security. But when the time comes, you can make significant improvements by using the techniques above.
Learn how to build fast, production-ready Docker images—read the rest of the Docker packaging guide for Python.
Production Docker packaging is too complicated to learn from Google searches
With as much as a dozen different intersecting technologies, and an unknown number of details to get right, Docker packaging isn't simple, especially for production.
But you still need fast builds that save you time, and security best practices that keep you safe.
Take the fast path to learning best practices, by using the Python on Docker Production Handbook.
Free ebook: Introduction to Dockerizing for Production
Learn a step-by-step iterative DevOps packaging process in this free mini-ebook. You'll learn what to prioritize, the decisions you need to make, and the ongoing organizational processes you need to start.
Plus, you'll join my newsletter and get weekly articles covering practical tools and techniques, from Docker packaging to Python best practices.