Docker vs. Singularity for data processing: UIDs and filesystem access
When you’re processing data, reading in files and writing out the result, containers are a great way to ensure reproducible runs. You package up all the binaries and libraries necessary to process your data, and each run uses the same files.
But while Docker is the most well-known container system, it’s not necessarily the easiest to use for data processing. Filesystem access, including ensuring correct UIDs, can be annoying. The problem is that Docker was not designed for this use case.
Docker is not the only way to create and run containers, though. In this article I’ll compare it to Singularity, a container runtime that was explicitly designed for data processing.
A batch script to process data
As a starting point, let’s say we have a simple script that reads in a file, processes it somehow, and writes it out:
$ ls input.txt script.py $ python script.py input.txt output.txt Wrote all data to to output.txt $ ls -l output.txt -rw-r--r-- 1 itamarst itamarst 24 Mar 11 11:41 output.txt
Next, we’ll package it as a Docker image:
Note: Outside the topic under discussion, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article. So if you’re going to be running your Python application in production with Docker, here are two ways to apply best practices:
- If you want to DIY: A detailed checklist, with examples and references
- If you want a working setup ASAP: A template, with best practices implemented for you
FROM python:3.8-slim-buster COPY script.py / RUN chmod 666 /script.py ENTRYPOINT ["python", "/script.py"]
Which we can build:
$ docker build -t itamarst/dataprocessor .
Docker problem #1: filesystem isolation
Here’s where the problems start. Docker was initially designed for ephemeral servers, that could be scaled up and down on different machines. So by default Docker tries to isolate the running container as much as possible.
First, containers have an isolated filesystem. That means the script won’t have any access to the host filesystem by default:
$ rm output.txt $ docker run itamarst/dataprocessor input.txt output.txt Traceback (most recent call last): File "/script.py", line 3, in <module> with open(sys.argv) as reader: FileNotFoundError: [Errno 2] No such file or directory: 'input.txt'
So if we want to be able to run the Docker image the same way we ran the script, we need to mount the local directory as a volume, and change the working directory to be that mounted volume:
$ docker run -v $PWD:/data -w /data itamarst/dataprocessor \ input.txt output.txt Wrote all data to output.txt
Docker problem #2: User IDs
The other thing that Docker isolates by defaults is user IDs. The container runs with a different user ID than the process that launches it, in this case as root.
And that means the output file is owned as root:
$ ls -l output -rw-r--r-- 1 root root 24 Mar 11 12:35 output.txt
This is not what we want.
What we can do is run the container as a user id and group id that match the current user:
$ rm output.txt rm: remove write-protected regular file 'output.txt'? y $ docker run -u "$(id -u):$(id -g)" \ -v $PWD:/data -w /data itamarst/dataprocessor \ input.txt output.txt Wrote all data to output.txt $ ls -l output.txt -rw-r--r-- 1 itamarst itamarst 24 Mar 11 12:38 output.txt
As you can see, these isolation problems are solvable—but it’s annoying to do, and somewhat fragile too. What if the process wanted to read some config file from your home directory?
Another approach: Singularity
Singularity is a container runtime, like Docker, but it starts from a very different place. It favors integration rather than isolation, while still preserving security restrictions on the container, and providing reproducible images.
Singularity has its own image format, but it can also load images from Docker registries:
$ singularity pull docker://itamarst/dataprocessor ... INFO: Creating SIF file... INFO: Build complete: dataprocessor_latest.sif $ ls -l dataprocessor_latest.sif -rwxr-xr-x 1 itamarst itamarst 60305408 Mar 11 11:41 dataprocessor_latest.sif
Here we see the first difference between Docker and Singularity: Docker images are stored off in the local image cache, and you’re expected to interact with them using the
docker image command, e.g.
docker image ls.
In contrast, Singularity images are just normal files on your filesystem. Now that we have a SIF file, we can run it:
$ rm output.txt $ ./dataprocessor_latest.sif input.txt output.txt Wrote all data to output.txt $ ls -l output.txt -rw-r--r-- 1 itamarst itamarst 24 Mar 11 12:44 output.txt
Notice that both problems we had with Docker are automatically solved by Singularity:
- The container had access to the host filesystem automatically (
/tmpare mounted automatically).
- The container runs as the current user automatically.
Same results, less work.
Docker has a bigger ecosystem than Singularity: it has Mac and Windows integration, lots and lots of tools support it, and it’s even been re-implemented from scratch (Podman from RedHat). Singularity is a less popular tool, with for example beta Mac support but no Windows support at the moment.
If batch data processing is your thing, Singularity might prove a better tool than Docker.
Learn how to build fast, production-ready Docker images—read the rest of the Docker packaging guide for Python.
Join my live, online class to learn more Docker packaging best practices for Python in production
You’re about to ship your Python application into production using Docker: your images are going to be critical infrastructure.
On June 11th and 12th, learn how to create production-ready packaging for Python applications by joining live online class.
Upgrade you skills with online training:
Production-ready Docker packaging for Python, June 11th+12th
Learn what it takes to package a Python application for production with Docker, from security to reproducibility to speed, with this live, two half-days online class.