Less capabilities, more security: minimizing privilege escalation in Docker

One important part of running your container in production is locking it down, to reduce the chances of an attacker using it as a starting point to exploit your whole system. Containers are inherently less isolated than virtual machines, and so more effort is needed to secure them.

Doing this is actually pretty straightforward:

  1. Don’t run your container as root.
  2. Run your container with less capabilities.

Let’s see why and how.

Note: Outside the specific topic under discussion, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.

Want a best-practices Dockerfile and build system? Check out my Production-Ready Python Containers product.

Don’t run as root

There are two reasons to avoid running as root, both security related.

First, it means your running process will have less privileges, which means if your process is somehow remotely compromised, the attacker will have a harder time escaping the container.

For example, a CVE in February 2019 that allowed escalation to root on the host was explicitly preventable by “a low privileged user inside the container”.

Second, and this is a more subtle point, running as a non-root user means you won’t try to take actions that require extra permissions. And that means you can run your container with less “capabilities”, making it even more secure.

Let’s see what this means.

What exactly is a capability?

“Capabilities” in this context are a technical term: Linux capabilities give processes the ability to do some of the many privileged operations only root can do by default. For example, CAP_CHOWN allows a process to “make arbitrary changes to file UIDs and GIDs”.

By default Docker grants a whole bunch of capabilities to a container, but not all them; running as root in a container isn’t quite as powerful as normal root.

I’ve created a little container that runs the getpcaps program that prints out a process’ capabilities:

FROM ubuntu:18.04
RUN apt-get update && apt-get install -y libcap2-bin inetutils-ping
CMD ["/sbin/getpcaps", "1"]

And as you can see, a container run as root has many capabilities:

$ docker run --rm getpcaps
Capabilities for '1': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip

Why you also need to drop capabilities

Now, you would think that running as a non-root user would lose these capabilities, and that is the case… but there are caveats:

$ docker run --rm --user 1000 getpcaps
Capabilities for '1': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+i

Notice that the non-root user (with uid 1000) has the same list of capabilities, but with “+i” (inherit) at the end instead of “+eip” (effective, permitted, inherit). That means the process doesn’t have access to these capabilities by default, but a child process can get them back by running an executable that can escalate permissions, e.g. via setuid.

For example, ping is typically either setuid (the process becomes root when run) or in more secure systems is set to give its process the NET_CAP_RAW capability. In either case it means the subprocess has more capabilities than the parent process:

$ docker run --rm --user 1000 -it getpcaps /bin/bash
I have no name!@9ce0e2e21c21:/$ ping 192.168.7.1
PING 192.168.7.1 (192.168.7.1): 56 data bytes
64 bytes from 192.168.7.1: icmp_seq=0 ttl=63 time=0.675 ms
64 bytes from 192.168.7.1: icmp_seq=1 ttl=63 time=25.509 ms

Now, imagine that ping had a bug, and the parent process takes it over and injects arbitrary code—at that point the user will have regained all those lost capabilities.

Dropping capabilities

So in addition to not running as root, you also want to explicitly drop capabilities completely, so they can’t be “inherited” by launching more capable executables. In some environments you won’t have this level of control, but if you’re using Docker directly, or using Kubernetes, you can explicitly add or drop capabilities.

For example, for many programs we can drop all capabilities:

$ docker run --rm --user 1000 -it --cap-drop ALL getpcaps /bin/bash
I have no name!@aacfefe4cc3a:/$ ping 192.167.7.1
ping: Lacking privilege for raw socket.

Now the setuid ping binary is insufficient to get those extra capabilities.

The correct way to run as a non-root user

One way you can run your container as non-root user is to use su or some variant to change users. The problem with that is that you start out running as root, and then execute an operation (changing user IDs) that requires a CAP_SETUID capability. So you can’t drop all capabilities if you do that.

What you want is your container running as a non-root user from the start. You can do that in your runtime configuration, but then you have to remember to do that.

So an even better solution is adding a new user when building the image, and using the Dockerfile USER command to change the user you run as. You won’t be able to bind to ports <1024, but that’s a good thing—that’s another capability (CAP_NET_BIND_SERVICE) you don’t need. And since your container is pretty much always behind a proxy of some sort, that’s fine, the external proxy can listen on port 443 for you.

Here’s a simple Dockerfile demonstrating how this works:

FROM ubuntu:18.04
RUN useradd --create-home appuser
WORKDIR /home/appuser
USER appuser

Takeaways

To have a more secure container:

  1. Run as a non-root user, using the Dockerfile’s USER command.
  2. Drop as many Linux capabilities as you can (ideally all of them) when you run your container.

To learn more about the subject, check out this site.


Learn even more—read the rest of the Docker packaging guide for Python.