Description
I've been flailing away at the idea to run a pool of rootless
containers as children of a docker container. My intent is to have the docker container run a web server that will spin up a pool of child, rootless containers to which requests can be proxied. These children would be designed to be isolated from each other and the host system from the side-effects of running untrusted code.
I need to pass additional file descriptors to these children which precludes running children as siblings using the host docker daemon. So here I am and I hope I'm not overstepping my bounds by asking for guidance via an issue.
Set up
Create a root filesystem tgz:
$ docker export $(docker create alpine) > rootfs.tgz
Dockerfile with runc
, libseccomp2
and the rootfs
:
FROM buildpack-deps
RUN apt-get update && apt-get install -y --no-install-recommends \
libseccomp2 \
&& rm -rf /var/lib/apt/lists/*
ADD rootfs.tgz /child/rootfs
ADD runc /usr/local/sbin/runc
WORKDIR /child/rootfs
RUN runc spec --rootless
CMD ["runc", "run", "child"]
False starts:
Build and run the container, adding CAP_SYS_ADMIN
:
$ docker run --rm -it --cap-add SYS_ADMIN $(docker build -q .)
container_linux.go:265: starting container process caused "process_linux.go:261: applying cgroup configuration
for process caused \"mkdir /sys/fs/cgroup/cpuset/child: read-only file system\""
Same, but mount /sys/fs/cgroup
as rw
:
$ docker run --rm -it --cap-add SYS_ADMIN -v /sys/fs/cgroup:/sys/fs/cgroup:rw $(do
cker build -q .)
container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"could
not create session key: operation not permitted\""
Same, but invoke runc
with --no-new-keyring
:
$ docker run --rm -it --cap-add SYS_ADMIN -v /sys/fs/cgroup:/sys/fs/cgroup:rw $(do
cker build -q .) runc run --no-new-keyring child
container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"rootfs
_linux.go:104: jailing process inside rootfs caused \\\"pivot_root operation not permitted\\\"\""
Finally 'working':
Same, but also add --no-pivot
:
$ docker run --rm -it --cap-add SYS_ADMIN -v /sys/fs/cgroup:/sys/fs/cgroup:rw $(do
cker build -q .) runc run --no-new-keyring --no-pivot child
/ #
Disclaimer: I'm still wrapping my head around all of the complexity and nuances of all the technologies we call 'containers' so please correct me if I'm wrong.
Removing pivot_root
seems like a bad idea given my objectives so I created a copy of the default seccomp profile and added the pivot_root
syscall to the big list of SCMP_ACT_ALLOW
calls. This let me drop --no-pivot
.
What kind of exposure am I creating by opening up by whitelisting the pivot_root
syscall?
Also, I'm past my abilities in trying to figure out how I might avoid --no-new-keyring
What kind of exposure am I creating by using the --no-new-keyring
flag?