Docker vs. Singularity for data processing: UIDs and filesystem access

When you’re processing data, reading in files and writing out the result, containers are a great way to ensure reproducible runs. You package up all the binaries and libraries necessary to process your data, and each run uses the same files.

But while Docker is the most well-known container system, it’s not necessarily the easiest to use for data processing. Filesystem access, including ensuring correct UIDs, can be annoying. The problem is that Docker was not designed for this use case.

Docker is not the only way to create and run containers, though. In this article I’ll compare it to Singularity, a container runtime that was explicitly designed for data processing.

A batch script to process data

As a starting point, let’s say we have a simple script that reads in a file, processes it somehow, and writes it out:

$ ls
input.txt   script.py
$ python script.py input.txt output.txt
Wrote all data to to output.txt
$ ls -l output.txt
-rw-r--r-- 1 itamarst itamarst 24 Mar 11 11:41 output.txt

Next, we’ll package it as a Docker image:

FROM python:3.8-slim-buster

COPY script.py /
RUN chmod 666 /script.py

ENTRYPOINT ["python", "/script.py"]

Note: Outside any specific best practice being demonstrated, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.

Python on Docker Production Handbook Need to ship quickly, and don’t have time to figure out every detail on your own? Read the concise, action-oriented Python on Docker Production Handbook.

Which we can build:

$ docker build -t itamarst/dataprocessor .

Docker problem #1: filesystem isolation

Here’s where the problems start. Docker was initially designed for ephemeral servers, that could be scaled up and down on different machines. So by default Docker tries to isolate the running container as much as possible.

First, containers have an isolated filesystem. That means the script won’t have any access to the host filesystem by default:

$ rm output.txt
$ docker run itamarst/dataprocessor input.txt output.txt
Traceback (most recent call last):
  File "/script.py", line 3, in <module>
    with open(sys.argv[1]) as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'input.txt'

So if we want to be able to run the Docker image the same way we ran the script, we need to mount the local directory as a volume, and change the working directory to be that mounted volume:

$ docker run -v $PWD:/data -w /data itamarst/dataprocessor \
    input.txt output.txt
Wrote all data to output.txt

Docker problem #2: User IDs

The other thing that Docker isolates by defaults is user IDs. The container runs with a different user ID than the process that launches it, in this case as root.

On macOS/Windows this shouldn’t be a problem, but on Linux the output file ends up owned as root:

$ ls -l output
-rw-r--r-- 1 root root 24 Mar 11 12:35 output.txt

This is not what we want.

What we can do is run the container as a user id and group id that match the current user:

$ rm output.txt 
rm: remove write-protected regular file 'output.txt'? y
$ docker run -u "$(id -u):$(id -g)" \
    -v $PWD:/data -w /data itamarst/dataprocessor \
    input.txt output.txt
Wrote all data to output.txt
$ ls -l output.txt 
-rw-r--r-- 1 itamarst itamarst 24 Mar 11 12:38 output.txt

As you can see, these isolation problems are solvable—but it’s annoying to do, and somewhat fragile too. What if the process wanted to read some config file from your home directory?

Another approach: Singularity

Singularity is a container runtime, like Docker, but it starts from a very different place. It favors integration rather than isolation, while still preserving security restrictions on the container, and providing reproducible images.

Singularity has its own image format, but it can also load images from Docker registries:

$ singularity pull docker://itamarst/dataprocessor
...
INFO:    Creating SIF file...
INFO:    Build complete: dataprocessor_latest.sif
$ ls -l dataprocessor_latest.sif 
-rwxr-xr-x 1 itamarst itamarst 60305408 Mar 11 11:41 dataprocessor_latest.sif

Here we see the first difference between Docker and Singularity: Docker images are stored off in the local image cache, and you’re expected to interact with them using the docker image command, e.g. docker image ls.

In contrast, Singularity images are just normal files on your filesystem. Now that we have a SIF file, we can run it:

$ rm output.txt
$ ./dataprocessor_latest.sif input.txt output.txt
Wrote all data to output.txt
$ ls -l output.txt
-rw-r--r-- 1 itamarst itamarst 24 Mar 11 12:44 output.txt

Notice that both problems we had with Docker are automatically solved by Singularity:

  1. The container had access to the host filesystem automatically ($HOME, $PWD, and /tmp are mounted automatically).
  2. The container runs as the current user automatically.

Same results, less work.

Other considerations

Docker has a bigger ecosystem than Singularity: it has Mac and Windows integration, lots and lots of tools support it, and it’s even been re-implemented from scratch (Podman from RedHat). Singularity is a less popular tool, with for example beta Mac support but no Windows support at the moment.

Nonetheless, it does have many users, and other tools do support it, from Snakemake for reproducible data processing pipelines to Hashicorp’s Nomad for orchestration.

If batch data processing is your thing, Singularity might prove a better tool than Docker.