|
1 | 1 | # Sysbox User Guide: Functional Limitations
|
2 | 2 |
|
3 | 3 | This document describes functional restrictions and limitations of Sysbox and
|
4 |
| -system containers. |
| 4 | +the containers created by it. |
5 | 5 |
|
6 | 6 | ## Contents
|
7 | 7 |
|
| 8 | +- [Sysbox Container Limitations](#sysbox-container-limitations) |
8 | 9 | - [Docker Restrictions](#docker-restrictions)
|
9 | 10 | - [Kubernetes Restrictions](#kubernetes-restrictions)
|
10 |
| -- [System Container Limitations](#system-container-limitations) |
11 | 11 | - [Sysbox Functional Limitations](#sysbox-functional-limitations)
|
12 | 12 |
|
13 |
| -## Docker Restrictions |
14 |
| - |
15 |
| -This section describes restrictions when launching containers with Docker + |
16 |
| -Sysbox. |
| 13 | +## Sysbox Container Limitations |
17 | 14 |
|
18 |
| -### Support for Docker's `--privileged` Option |
| 15 | +Sysbox enables containers to run applications or system software such as |
| 16 | +systemd, Docker, Kubernetes, K3s, etc., seamlessly & securely (e.g., no |
| 17 | +privileged containers, no complex setups). |
19 | 18 |
|
20 |
| -Sysbox system containers are incompatible with the Docker `--privileged` flag. |
| 19 | +While our goal is for Sysbox containers to run **any software that runs on |
| 20 | +bare-metal or VMs**, this is still a work-in-progress. |
21 | 21 |
|
22 |
| -The raison d'être for Sysbox is to avoid the use of (very insecure) privileged |
23 |
| -containers yet enable users to run any type of software inside the container. |
| 22 | +Thus, there are some limitations at this time. The table below describes these. |
24 | 23 |
|
25 |
| -Using the Docker `--privileged` + Sysbox will fail: |
| 24 | +| Limitation | Description | Affected Software | Planned Fix | |
| 25 | +| -------- | --------------- | ----- | :--: | |
| 26 | +| mknod | Fails with "operation not permitted". | Software that creates devices such as /dev/tun, /dev/tap, /dev/fuse, etc. | WIP | |
| 27 | +| binfmt-misc | Fails with "permission denied". | Software that uses /proc/sys/fs/binfmt_misc inside the container (e.g., buildx+QEMU for multi-arch builds). | WIP | |
| 28 | +| Nested user-namespace | `unshare -U --mount-proc` fails with "invalid argument". | Software that uses the Linux user-namespace (e.g., Docker + userns-remap). Note that the Sysbox container is rootless already, so this implies nesting Linux user-namespaces. | Yes | |
| 29 | +| Host device access | Host devices exposed to the container (e.g., `docker run --devices ...`) show up with "nobody:nogroup" ownership. Thus, access to them will fail with "permission denied" unless the device grants read/write permissions to "others". | Software that needs access to hardware accelerators. | Yes | |
| 30 | +| rpc-pipefs | Mounting rpc-pipefs fails with "permission denied". | Running an NFS server inside the Sysbox container. | Yes | |
| 31 | +| Host sockets | Host sockets exposed to the container (e.g., `docker run -v /host/socket:/mount/point ...`) don't always work as Sysbox mounts shiftfs on them. | Software that requires host socket access (e.g, VNC). | Yes | |
| 32 | +| insmod | Fails with "operation not permitted". | Can't load kernel modules from inside containers. | TBD | |
26 | 33 |
|
27 |
| -```console |
28 |
| -$ docker run --runtime=sysbox-runc --privileged -it alpine |
29 |
| -docker: Error response from daemon: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"rootfs_linux.go:67: setting up ptmx caused \\\"remove dev/ptmx: device or resource busy\\\"\"": unknown. |
30 |
| -ERRO[0000] error waiting for container: context canceled |
31 |
| -``` |
32 | 34 |
|
33 |
| -### Support for Docker's `--userns=host` Option |
| 35 | +**NOTES:** |
34 | 36 |
|
35 |
| -When Docker is configured in userns-remap mode, Docker offers the ability |
36 |
| -to disable that mode on a per container basis via the `--userns=host` |
37 |
| -option in the `docker run` and `docker create` commands. |
| 37 | +* "WIP" means the fix is being worked-on right now. "TBD" means a |
| 38 | + decision is yet to be made. |
38 | 39 |
|
39 |
| -This option **does not work** with Sysbox (i.e., don't use |
40 |
| -`docker run --runtime=sysbox-runc --userns=host ...`). |
| 40 | +* If you find other software that fails inside the Sysbox container, please open |
| 41 | + a GitHub issue so we can add it to the list and work on a fix. |
41 | 42 |
|
42 |
| -Note that usage of this option is rare as it can lead to the problems as |
43 |
| -described [in this Docker article](https://docs.docker.com/engine/security/userns-remap/#disable-namespace-remapping-for-a-container). |
| 43 | +## Docker Restrictions |
44 | 44 |
|
45 |
| -### Support for Docker's `--pid=host` and `--network=host` Options |
| 45 | +This section describes restrictions when using Docker + Sysbox. |
46 | 46 |
|
47 |
| -System containers do not support sharing the pid or network namespaces |
48 |
| -with the host (as this is not secure and it's incompatible with the |
49 |
| -system container's user namespace). |
| 47 | +These restrictions are in place because they reduce or break container-to-host |
| 48 | +isolation, which is one of the key features of Sysbox. |
50 | 49 |
|
51 |
| -For example, when using Docker to launch system containers, the |
52 |
| -`docker run --pid=host` and `docker run --network=host` options |
53 |
| -do not work with system containers. |
| 50 | +Note that some of these options (e.g., --privileged) are typically needed when |
| 51 | +running complex workloads in containers. With Sysbox, this is no longer needed. |
54 | 52 |
|
55 |
| -### Support for Exposing Host Devices inside System Containers |
| 53 | +| Limitation | Description | Comment | |
| 54 | +| ---------- | --------------- | ------- | |
| 55 | +| docker --privileged | Does not work with Sysbox | Breaks container-to-host isolation. | |
| 56 | +| docker --userns=host | Does not work with Sysbox | Breaks container-to-host isolation. | |
| 57 | +| docker --pid=host | Does not work with Sysbox | Breaks container-to-host isolation. | |
| 58 | +| docker --net=host | Does not work with Sysbox | Breaks container-to-host isolation. | |
56 | 59 |
|
57 |
| -Sysbox does not currently support exposing host devices inside system |
58 |
| -containers (e.g., via the `docker run --device` option). |
59 | 60 |
|
60 | 61 | ## Kubernetes Restrictions
|
61 | 62 |
|
62 |
| -This section describes restrictions when launching containers with Kubernetes + |
63 |
| -Sysbox. |
64 |
| - |
65 |
| -### Pods limited to 16 per-node on Sysbox-CE |
66 |
| - |
67 |
| -Pods launched with the Sysbox Community Edition are **limited to \*\*16 pods per worker node\*\***. |
68 |
| - |
69 |
| -Once this limit is reached, new pods scheduled on the node will remain in the |
70 |
| -"ContainerCreating" state. Such pods need to be terminated and re-created once |
71 |
| -there is sufficient capacity on the node. |
72 |
| - |
73 |
| -#### ** --- Sysbox-EE Feature Highlight --- ** |
74 |
| - |
75 |
| -With Sysbox Enterprise (Sysbox-EE) this limitation is removed, as it's designed |
76 |
| -for greater scalability. Thus, you can launch as many pods as will fit on the |
77 |
| -Kubernetes node, allowing you to get the best utilization of the hardware. |
78 |
| - |
79 |
| -Note that the number of pods that can be deployed on a node depends on many |
80 |
| -factors such as the number of CPUs on the node, the memory size on the node, the |
81 |
| -the amount of storage, the type of workloads running in the pods, resource |
82 |
| -limits on the pod, etc.) |
83 |
| - |
84 |
| -### Privileged pods are not allowed |
85 |
| - |
86 |
| -The pod's security context must not have the `privileged: true` attribute. |
87 |
| - |
88 |
| -The raison d'être for Sysbox is to avoid the use of (very insecure) privileged |
89 |
| -containers yet enable users to run any type of software inside the container. |
90 |
| - |
91 |
| -### Sharing Linux Namespaces with the Host is not allowed |
| 63 | +This section describes restrictions when using Kubernetes + Sysbox. |
92 | 64 |
|
93 |
| -The pod's spec must not share Linux namespaces with the host, as this breaks |
94 |
| -container isolation. Thus avoid setting these in the pod's spec: |
| 65 | +Some of these restrictions are in place because they reduce or break |
| 66 | +container-to-host isolation, which is one of the key features of Sysbox. |
95 | 67 |
|
96 |
| -```yaml |
97 |
| -hostNetwork: true |
98 |
| -hostIPC: true |
99 |
| -hostPID: true |
100 |
| -``` |
| 68 | +Note that some of these options (e.g., privileged: true) are typically needed when |
| 69 | +running complex workloads in pods. With Sysbox, this is no longer needed. |
101 | 70 |
|
102 |
| -## System Container Limitations |
103 |
| -
|
104 |
| -This section describes limitations for software running inside a system |
105 |
| -container. |
106 |
| -
|
107 |
| -### Creating User Namespaces inside a System Container |
108 |
| -
|
109 |
| -System containers do not currently support creating a user-namespace |
110 |
| -inside the system container and mounting procfs in it. |
111 |
| -
|
112 |
| -That is, executing the following instruction inside a system container |
113 |
| -is not supported: |
114 |
| -
|
115 |
| - unshare -U -i -m -n -p -u -f --mount-proc -r bash |
116 |
| -
|
117 |
| -The reason this is not yet supported is that Sysbox is not currently |
118 |
| -capable of ensuring that the procfs mounted inside the unshared |
119 |
| -namespace is the proper one. We expect to fix this soon. |
| 71 | +| Limitation | Description | Comment | |
| 72 | +| ---------- | --------------- | ------- | |
| 73 | +| 16 pods /node | Sysbox-CE is limited to 16 pods per node | Sysbox Enterprise removes this limitation. | |
| 74 | +| privileged: true | Not supported in pod security context | Breaks container-to-host isolation. | |
| 75 | +| hostNetwork: true | Not supported in pod security context | Breaks container-to-host isolation. | |
| 76 | +| hostIPC: true | Not supported in pod security context | Breaks container-to-host isolation. | |
| 77 | +| hostPID: true | Not supported in pod security context | Breaks container-to-host isolation. | |
120 | 78 |
|
121 | 79 | ## Sysbox Functional Limitations
|
122 | 80 |
|
123 |
| -### Sysbox must run as root on the host |
124 |
| -
|
125 |
| -Sysbox must run with root privileges on the host system. It won't |
126 |
| -work if executed without root privileges. |
127 |
| -
|
128 |
| -Root privileges are necessary in order for Sysbox to interact with the Linux |
129 |
| -kernel in order to create the containers and perform many of the advanced |
130 |
| -functions it provides (e.g., procfs virtualization, sysfs virtualization, etc.) |
131 |
| -
|
132 |
| -### Checkpoint and Restore Support |
133 |
| -
|
134 |
| -Sysbox does not currently support checkpoint and restore of system containers. |
135 |
| -
|
136 |
| -### Sysbox Nesting |
137 |
| -
|
138 |
| -Sysbox must run at the host level (or within a privileged container if you must). |
| 81 | +| Limitation | Description | Planned Fix | |
| 82 | +| ---------- | --------------- | ------- | |
| 83 | +| Sysbox must run as root | Sysbox needs root privileges on the host to perform the advanced OS virtualization it provides (e.g., procfs/sysfs emualtion, syscall trappings, etc.) | TBD | |
| 84 | +| Container Checkpoint/Restore | Not yet supported | Yes | |
| 85 | +| Sysbox Nesting | Running Sysbox inside a Sysbox container is not supported | TBD | |
139 | 86 |
|
140 |
| -Sysbox does not work when running inside a system container. This implies that |
141 |
| -we don't support running a system container inside a system container at this |
142 |
| -time. |
0 commit comments