Less capabilities, more security: preventing Docker escalation attacks

One important part of running your container in production is locking it down, to reduce the chances of an attacker using it as a starting point to exploit your whole system. Containers are inherently less isolated than virtual machines, and so more effort is needed to secure them.

Doing this is actually pretty straightforward:

  1. Don’t run your container as root.
  2. Run your container with less capabilities.

Let’s see why and how.

Note: Outside any specific best practice being demonstrated, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.

Python on Docker Production Handbook Need to ship quickly, and don’t have time to figure out every detail on your own? Read the concise, action-oriented Python on Docker Production Handbook.

Don’t run as root

There are two reasons to avoid running as root, both security related.

First, it means your running process will have less privileges, which means if your process is somehow remotely compromised, the attacker will have a harder time escaping the container.

For example, a CVE in February 2019 that allowed escalation to root on the host was explicitly preventable by “a low privileged user inside the container”.

Second, and this is a more subtle point, running as a non-root user means you won’t try to take actions that require extra permissions. And that means you can run your container with less “capabilities”, making it even more secure.

Let’s see what this means.

What exactly is a capability?

“Capabilities” in this context are a technical term: Linux capabilities give processes the ability to do some of the many privileged operations only root can do by default. For example, CAP_CHOWN allows a process to “make arbitrary changes to file UIDs and GIDs”.

By default Docker grants a whole bunch of capabilities to a container, but not all them; running as root in a container isn’t quite as powerful as normal root.

I’ve created a little container that runs the getpcaps program that prints out a process’ capabilities:

FROM ubuntu:18.04
RUN apt-get update && apt-get install -y libcap2-bin iputils-ping
CMD ["/sbin/getpcaps", "1"]

And as you can see, a container run as root has many capabilities:

$ docker image build -t getpcaps .
$ docker container run --rm getpcaps
1: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep

Why you also need to drop capabilities

Now, you would think that running as a non-root user would lose these capabilities, and that is the case… but there are caveats.

For example, ping is typically either setuid (the process becomes root when run) or in more secure systems is set to give its process the NET_CAP_RAW capability. In either case it means the subprocess has more capabilities than the parent process:

$ docker run --rm --user 1000 -it getpcaps /bin/bash
ubuntu@b9f18f57bbaa:/$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.041 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.021 ms
...

Now, imagine that ping had a bug, and the parent process takes it over and injects arbitrary code—at that point the user will have regained all those lost capabilities.

Why does this work? Because the so-called bound capabilities still allow processes run as root to get extra capabilities. We can see a process’ capabilities in /proc/<self>/status, and decode them with the capsh tool:

$ docker run --rm --user 1000 -it getpcaps /bin/bash
ubuntu@ca7f423638e7:/$ cat /proc/self/status | grep Cap
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
ubuntu@ca7f423638e7:/$ capsh --decode=00000000a80425fb
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

Dropping capabilities

So in addition to not running as root, you also want to explicitly drop capabilities completely, so they can’t be “inherited” by launching more capable executables. In some environments you won’t have this level of control, but if you’re using Docker directly, or using Kubernetes, you can explicitly add or drop capabilities.

For example, for many programs we can drop all capabilities:

$ docker run --rm --user 1000 -it --cap-drop ALL getpcaps /bin/bash
ubuntu@2e016758ec20:/$ ping 127.0.0.1
bash: /usr/bin/ping: Operation not permitted

Now the setuid ping binary is insufficient to get those extra capabilities. Some ping implementations will actually function even without capabilities, but that’s not really relevant to our concerns; the key point is that dropping capabilities reduces security risks.

The correct way to run as a non-root user

One way you can run your container as non-root user is to use su or some variant to change users. The problem with that is that you start out running as root, and then execute an operation (changing user IDs) that requires a CAP_SETUID capability. So you can’t drop all capabilities if you do that.

What you want is your container running as a non-root user from the start. You can do that in your runtime configuration, but then you have to remember to do that.

So an even better solution is adding a new user when building the image, and using the Dockerfile USER command to change the user you run as. You won’t be able to bind to ports <1024, but that’s a good thing—that’s another capability (CAP_NET_BIND_SERVICE) you don’t need. And since your container is pretty much always behind a proxy of some sort, that’s fine, the external proxy can listen on port 443 for you.

Here’s a simple Dockerfile demonstrating how this works:

FROM ubuntu:24.04
RUN useradd --create-home appuser
WORKDIR /home/appuser
USER appuser

Takeaways

To have a more secure container:

  1. Run as a non-root user, using the Dockerfile’s USER command.
  2. Drop as many Linux capabilities as you can (ideally all of them) when you run your container.