Docker vs. Singularity for data processing: UIDs and filesystem access
When you’re processing data, reading in files and writing out the result, containers are a great way to ensure reproducible runs. You package up all the binaries and libraries necessary to process your data, and each run uses the same files.
But while Docker is the most well-known container system, it’s not necessarily the easiest to use for data processing. Filesystem access, including ensuring correct UIDs, can be annoying. The problem is that Docker was not designed for this use case.
Docker is not the only way to create and run containers, though. In this article I’ll compare it to Singularity, a container runtime that was explicitly designed for data processing.
A batch script to process data
As a starting point, let’s say we have a simple script that reads in a file, processes it somehow, and writes it out:
$ ls input.txt script.py $ python script.py input.txt output.txt Wrote all data to to output.txt $ ls -l output.txt -rw-r--r-- 1 itamarst itamarst 24 Mar 11 11:41 output.txt
Next, we’ll package it as a Docker image:
Note: Outside the very specific topic under discussion, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.
To ensure you’re writing secure, correct, fast Dockerfiles, consider my Quickstart guide, which includes a packaging process and 60+ best practices.
FROM python:3.8-slim-buster COPY script.py / RUN chmod 666 /script.py ENTRYPOINT ["python", "/script.py"]
Which we can build:
$ docker build -t itamarst/dataprocessor .
Docker problem #1: filesystem isolation
Here’s where the problems start. Docker was initially designed for ephemeral servers, that could be scaled up and down on different machines. So by default Docker tries to isolate the running container as much as possible.
First, containers have an isolated filesystem. That means the script won’t have any access to the host filesystem by default:
$ rm output.txt $ docker run itamarst/dataprocessor input.txt output.txt Traceback (most recent call last): File "/script.py", line 3, in <module> with open(sys.argv) as reader: FileNotFoundError: [Errno 2] No such file or directory: 'input.txt'
So if we want to be able to run the Docker image the same way we ran the script, we need to mount the local directory as a volume, and change the working directory to be that mounted volume:
$ docker run -v $PWD:/data -w /data itamarst/dataprocessor \ input.txt output.txt Wrote all data to output.txt
Docker problem #2: User IDs
The other thing that Docker isolates by defaults is user IDs. The container runs with a different user ID than the process that launches it, in this case as root.
And that means the output file is owned as root:
$ ls -l output -rw-r--r-- 1 root root 24 Mar 11 12:35 output.txt
This is not what we want.
What we can do is run the container as a user id and group id that match the current user:
$ rm output.txt rm: remove write-protected regular file 'output.txt'? y $ docker run -u "$(id -u):$(id -g)" \ -v $PWD:/data -w /data itamarst/dataprocessor \ input.txt output.txt Wrote all data to output.txt $ ls -l output.txt -rw-r--r-- 1 itamarst itamarst 24 Mar 11 12:38 output.txt
As you can see, these isolation problems are solvable—but it’s annoying to do, and somewhat fragile too. What if the process wanted to read some config file from your home directory?
Another approach: Singularity
Singularity is a container runtime, like Docker, but it starts from a very different place. It favors integration rather than isolation, while still preserving security restrictions on the container, and providing reproducible images.
Singularity has its own image format, but it can also load images from Docker registries:
$ singularity pull docker://itamarst/dataprocessor ... INFO: Creating SIF file... INFO: Build complete: dataprocessor_latest.sif $ ls -l dataprocessor_latest.sif -rwxr-xr-x 1 itamarst itamarst 60305408 Mar 11 11:41 dataprocessor_latest.sif
Here we see the first difference between Docker and Singularity: Docker images are stored off in the local image cache, and you’re expected to interact with them using the
docker image command, e.g.
docker image ls.
In contrast, Singularity images are just normal files on your filesystem. Now that we have a SIF file, we can run it:
$ rm output.txt $ ./dataprocessor_latest.sif input.txt output.txt Wrote all data to output.txt $ ls -l output.txt -rw-r--r-- 1 itamarst itamarst 24 Mar 11 12:44 output.txt
Notice that both problems we had with Docker are automatically solved by Singularity:
- The container had access to the host filesystem automatically (
/tmpare mounted automatically).
- The container runs as the current user automatically.
Same results, less work.
Docker has a bigger ecosystem than Singularity: it has Mac and Windows integration, lots and lots of tools support it, and it’s even been re-implemented from scratch (Podman from RedHat). Singularity is a less popular tool, with for example beta Mac support but no Windows support at the moment.
If batch data processing is your thing, Singularity might prove a better tool than Docker.
Learn how to build fast, production-ready Docker images—read the rest of the Docker packaging guide for Python.
Docker packaging is complicated, and you can’t afford to screw up production
From fast builds that save you time, to security best practices that keep you safe, how can you quickly gain the expertise you need to package your Python application for production?
Take the fast path to learning best practices, by using the Python on Docker Production Quickstart.
Learn practical Python software engineering skills you can use at your job
Too much to learn? Don't know where to start?
Sign up for my newsletter, and join over 2400 Python developers and data scientists learning practical tools and techniques, from Docker packaging to testing to Python best practices, with a free new article in your inbox every week.