Less capabilities, more security: preventing Docker escalation attacks
One important part of running your container in production is locking it down, to reduce the chances of an attacker using it as a starting point to exploit your whole system. Containers are inherently less isolated than virtual machines, and so more effort is needed to secure them.
Doing this is actually pretty straightforward:
- Don’t run your container as
root
. - Run your container with less capabilities.
Let’s see why and how.
Note: Outside any specific best practice being demonstrated, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.
Need to ship quickly, and don’t have time to figure out every detail on your own? Read the concise, action-oriented Python on Docker Production Handbook.
Don’t run as root
There are two reasons to avoid running as root
, both security related.
First, it means your running process will have less privileges, which means if your process is somehow remotely compromised, the attacker will have a harder time escaping the container.
For example, a CVE in February 2019 that allowed escalation to root
on the host was explicitly preventable by “a low privileged user inside the container”.
Second, and this is a more subtle point, running as a non-root
user means you won’t try to take actions that require extra permissions.
And that means you can run your container with less “capabilities”, making it even more secure.
Let’s see what this means.
What exactly is a capability?
“Capabilities” in this context are a technical term: Linux capabilities give processes the ability to do some of the many privileged operations only root
can do by default.
For example, CAP_CHOWN
allows a process to “make arbitrary changes to file UIDs and GIDs”.
By default Docker grants a whole bunch of capabilities to a container, but not all them; running as root
in a container isn’t quite as powerful as normal root
.
I’ve created a little container that runs the getpcaps
program that prints out a process’ capabilities:
FROM ubuntu:18.04
RUN apt-get update && apt-get install -y libcap2-bin inetutils-ping
CMD ["/sbin/getpcaps", "1"]
And as you can see, a container run as root
has many capabilities:
$ docker run --rm getpcaps
Capabilities for '1': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip
Why you also need to drop capabilities
Now, you would think that running as a non-root
user would lose these capabilities, and that is the case… but there are caveats:
$ docker run --rm --user 1000 getpcaps
Capabilities for '1': = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+i
Notice that the non-root
user (with uid 1000) has the same list of capabilities, but with “+i” (inherit) at the end instead of “+eip” (effective, permitted, inherit).
That means the process doesn’t have access to these capabilities by default, but a child process can get them back by running an executable that can escalate permissions, e.g. via setuid
.
For example, ping
is typically either setuid
(the process becomes root when run) or in more secure systems is set to give its process the NET_CAP_RAW
capability. In either case it means the subprocess has more capabilities than the parent process:
$ docker run --rm --user 1000 -it getpcaps /bin/bash
I have no name!@9ce0e2e21c21:/$ ping 192.168.7.1
PING 192.168.7.1 (192.168.7.1): 56 data bytes
64 bytes from 192.168.7.1: icmp_seq=0 ttl=63 time=0.675 ms
64 bytes from 192.168.7.1: icmp_seq=1 ttl=63 time=25.509 ms
Now, imagine that ping
had a bug, and the parent process takes it over and injects arbitrary code—at that point the user will have regained all those lost capabilities.
Dropping capabilities
So in addition to not running as root
, you also want to explicitly drop capabilities completely, so they can’t be “inherited” by launching more capable executables.
In some environments you won’t have this level of control, but if you’re using Docker directly, or using Kubernetes, you can explicitly add or drop capabilities.
For example, for many programs we can drop all capabilities:
$ docker run --rm --user 1000 -it --cap-drop ALL getpcaps /bin/bash
I have no name!@aacfefe4cc3a:/$ ping 192.167.7.1
ping: Lacking privilege for raw socket.
Now the setuid
ping
binary is insufficient to get those extra capabilities.
The correct way to run as a non-root
user
One way you can run your container as non-root
user is to use su
or some variant to change users.
The problem with that is that you start out running as root
, and then execute an operation (changing user IDs) that requires a CAP_SETUID
capability.
So you can’t drop all capabilities if you do that.
What you want is your container running as a non-root user from the start. You can do that in your runtime configuration, but then you have to remember to do that.
So an even better solution is adding a new user when building the image, and using the Dockerfile USER
command to change the user you run as.
You won’t be able to bind to ports <1024, but that’s a good thing—that’s another capability (CAP_NET_BIND_SERVICE
) you don’t need.
And since your container is pretty much always behind a proxy of some sort, that’s fine, the external proxy can listen on port 443 for you.
Here’s a simple Dockerfile demonstrating how this works:
FROM ubuntu:18.04
RUN useradd --create-home appuser
WORKDIR /home/appuser
USER appuser
Takeaways
To have a more secure container:
- Run as a non-
root
user, using the Dockerfile’sUSER
command. - Drop as many Linux capabilities as you can (ideally all of them) when you run your container.