Docker vs. Singularity for data processing: UIDs and filesystem access
When you’re processing data, reading in files and writing out the result, containers are a great way to ensure reproducible runs. You package up all the binaries and libraries necessary to process your data, and each run uses the same files.
But while Docker is the most well-known container system, it’s not necessarily the easiest to use for data processing. Filesystem access, including ensuring correct UIDs, can be annoying. The problem is that Docker was not designed for this use case.
Docker is not the only way to create and run containers, though. In this article I’ll compare it to Singularity, a container runtime that was explicitly designed for data processing.
A batch script to process data
As a starting point, let’s say we have a simple script that reads in a file, processes it somehow, and writes it out:
$ ls input.txt script.py $ python script.py input.txt output.txt Wrote all data to to output.txt $ ls -l output.txt -rw-r--r-- 1 itamarst itamarst 24 Mar 11 11:41 output.txt
Next, we’ll package it as a Docker image:
FROM python:3.8-slim-buster COPY script.py / RUN chmod 666 /script.py ENTRYPOINT ["python", "/script.py"]
Note: Outside any specific best practice being demonstrated, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.
Need to ship quickly, and don’t have time to figure out every detail on your own? Read the concise, action-oriented Python on Docker Production Handbook.
Which we can build:
$ docker build -t itamarst/dataprocessor .
Docker problem #1: filesystem isolation
Here’s where the problems start. Docker was initially designed for ephemeral servers, that could be scaled up and down on different machines. So by default Docker tries to isolate the running container as much as possible.
First, containers have an isolated filesystem. That means the script won’t have any access to the host filesystem by default:
$ rm output.txt $ docker run itamarst/dataprocessor input.txt output.txt Traceback (most recent call last): File "/script.py", line 3, in <module> with open(sys.argv) as reader: FileNotFoundError: [Errno 2] No such file or directory: 'input.txt'
So if we want to be able to run the Docker image the same way we ran the script, we need to mount the local directory as a volume, and change the working directory to be that mounted volume:
$ docker run -v $PWD:/data -w /data itamarst/dataprocessor \ input.txt output.txt Wrote all data to output.txt
Docker problem #2: User IDs
The other thing that Docker isolates by defaults is user IDs. The container runs with a different user ID than the process that launches it, in this case as root.
On macOS/Windows this shouldn’t be a problem, but on Linux the output file ends up owned as root:
$ ls -l output -rw-r--r-- 1 root root 24 Mar 11 12:35 output.txt
This is not what we want.
What we can do is run the container as a user id and group id that match the current user:
$ rm output.txt rm: remove write-protected regular file 'output.txt'? y $ docker run -u "$(id -u):$(id -g)" \ -v $PWD:/data -w /data itamarst/dataprocessor \ input.txt output.txt Wrote all data to output.txt $ ls -l output.txt -rw-r--r-- 1 itamarst itamarst 24 Mar 11 12:38 output.txt
As you can see, these isolation problems are solvable—but it’s annoying to do, and somewhat fragile too. What if the process wanted to read some config file from your home directory?
Another approach: Singularity
Singularity is a container runtime, like Docker, but it starts from a very different place. It favors integration rather than isolation, while still preserving security restrictions on the container, and providing reproducible images.
Singularity has its own image format, but it can also load images from Docker registries:
$ singularity pull docker://itamarst/dataprocessor ... INFO: Creating SIF file... INFO: Build complete: dataprocessor_latest.sif $ ls -l dataprocessor_latest.sif -rwxr-xr-x 1 itamarst itamarst 60305408 Mar 11 11:41 dataprocessor_latest.sif
Here we see the first difference between Docker and Singularity: Docker images are stored off in the local image cache, and you’re expected to interact with them using the
docker image command, e.g.
docker image ls.
In contrast, Singularity images are just normal files on your filesystem. Now that we have a SIF file, we can run it:
$ rm output.txt $ ./dataprocessor_latest.sif input.txt output.txt Wrote all data to output.txt $ ls -l output.txt -rw-r--r-- 1 itamarst itamarst 24 Mar 11 12:44 output.txt
Notice that both problems we had with Docker are automatically solved by Singularity:
- The container had access to the host filesystem automatically (
/tmpare mounted automatically).
- The container runs as the current user automatically.
Same results, less work.
Docker has a bigger ecosystem than Singularity: it has Mac and Windows integration, lots and lots of tools support it, and it’s even been re-implemented from scratch (Podman from RedHat). Singularity is a less popular tool, with for example beta Mac support but no Windows support at the moment.
If batch data processing is your thing, Singularity might prove a better tool than Docker.
The concise and action-oriented guide to Docker packaging for production
Docker packaging for production is complicated, with as many as 70+ best practices to get right. And you want small images, fast builds, and your Python application running securely.
Take the fast path to learning best practices, by using the Python on Docker Production Handbook.
Free ebook: Introduction to Dockerizing for Production
Learn a step-by-step iterative DevOps packaging process in this free mini-ebook. You'll learn what to prioritize, the decisions you need to make, and the ongoing organizational processes you need to start.
Plus, you'll join my newsletter and get weekly articles covering practical tools and techniques, from Docker packaging to Python best practices.