-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Jess Frazelle <acidburn@google.com>
- Loading branch information
Showing
1 changed file
with
86 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,65 +1,117 @@ | ||
#Support "no new privileges" in Kubernetes | ||
# No New Privileges | ||
|
||
##Description | ||
- [Description](#description) | ||
* [Interactions with other Linux primitives](#interactions-with-other-linux-primitives) | ||
- [Current Implementations](#current-implementations) | ||
* [Support in Docker](#support-in-docker) | ||
* [Support in rkt](#support-in-rkt) | ||
* [Support in OCI runtimes](#support-in-oci-runtimes) | ||
- [Existing SecurityContext objects](#existing-securitycontext-objects) | ||
- [Changes of SecurityContext objects](#changes-of-securitycontext-objects) | ||
|
||
In Linux, the `execve` system call can grant more privileges to a newly-created process than its parent process. Considering security issues, since Linux kernel v3.5, there is a new flag named `no_new_privs` added to prevent those new privileges from being granted to the processes. | ||
## Description | ||
|
||
`no_new_privs` is inherited across `fork`, `clone` and `execve` and can not be unset. With `no_new_privs` set, `execve` promises not to grant the privilege to do anything that could not have been done without the `execve` call. | ||
In Linux, the `execve` system call can grant more privileges to a newly-created | ||
process than its parent process. Considering security issues, since Linux kernel | ||
v3.5, there is a new flag named `no_new_privs` added to prevent those new | ||
privileges from being granted to the processes. | ||
|
||
For more details about `no_new_privs`, please check the Linux kernel document [here](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt). | ||
[`no_new_privs`](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt) | ||
is inherited across `fork`, `clone` and `execve` and can not be unset. With | ||
`no_new_privs` set, `execve` promises not to grant the privilege to do anything | ||
that could not have been done without the `execve` call. | ||
|
||
Docker started to support `no_new_privs` option since 1.11. Here is the [link](https://github.com/docker/docker/issues/20329) of the ticket in Docker community to support `no_new_privs` option. | ||
For more details about `no_new_privs`, please check the | ||
[Linux kernel documention](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt). | ||
|
||
We want to support the creation of containers with `no_new_privs` enabled in Kubernetes, which will make the Kubernetes cluster more safe. Here is the [link](https://github.com/kubernetes/kubernetes/issues/38417) of the ticket in Kubernetes community to track this proposal. | ||
This is different from `NOSUID` in that `no_new_privs`can give permission to | ||
the container process to further restrict child processes with seccomp. This | ||
permission goes only one-way in that the container process can not grant more | ||
permissions, only further restrict. | ||
|
||
### Interactions with other Linux primitives | ||
|
||
##Current implementation | ||
- suid binaries: will break when `no_new_privs` is enabled | ||
- seccomp2 as a non root user: requires `no_new_privs` | ||
- seccomp2 with dropped `CAP_SYS_ADMIN`: requires `no_new_privs` | ||
- ambient capabilities: requires `no_new_privs` | ||
- selinux transactions: bugs that were fixed documented [here](https://github.com/moby/moby/issues/23981#issuecomment-233121969) | ||
|
||
###Support in Docker | ||
|
||
Since Docker 1.11, user can specify `--security-opt` to enable `no_new_privs` while creating containers, e.g. `docker run --security-opt=no-new-privileges busybox` | ||
## Current Implementations | ||
|
||
For program client, Docker provides an object named `ContainerCreateConfig` defined in package `github.com/docker/engine-api/types` to config container creation parameters. In this object, there is a string array `HostConfig.SecurityOpt` to specify the security options. Client can utilize this field to specify the arguments for security options while creating new containers. | ||
### Support in Docker | ||
|
||
###Support in OCI runtimes | ||
Since Docker 1.11, a user can specify `--security-opt` to enable `no_new_privs` | ||
while creating containers, for example | ||
`docker run --security-opt=no_new_privs busybox`. | ||
|
||
Since version 0.3.0 of the OCI runtime specification, a user can specify the `noNewPrivs` boolean flag in the configuration file. | ||
Docker provides via their Go api an object named `ContainerCreateConfig` to | ||
configure container creation parameters. In this object, there is a string | ||
array `HostConfig.SecurityOpt` to specify the security options. Client can | ||
utilize this field to specify the arguments for security options while | ||
creating new containers. | ||
|
||
More details of OCI implementation can be checked [here](https://github.com/opencontainers/runtime-spec/pull/290). | ||
This field did not scale well for the Docker client, so it's suggested that | ||
Kubernetes does not follow that design. | ||
|
||
###SecurityContext in Kubernetes | ||
More details of the Docker implementation can be read | ||
[here](https://github.com/moby/moby/pull/20727) as well as the original | ||
discussion [here](https://github.com/moby/moby/issues/20329). | ||
|
||
Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` for `PodSpec`. `SecurityContext` objects define the related security options for Kubernetes containers, e.g. selinux options. | ||
### Support in rkt | ||
|
||
While creating a container, kubelet parses the security context object and formats the security option strings for Docker. The security options strings will finally be inserted into `ContainerCreateConfig.HostConfig.SecurityOpt` and passed to Docker. Different Kubernetes runtimes now are using different methods to parse and format the security option strings: | ||
* method `#getSecurityOpts` in `docker_mager_xxxx.go` for Docker runtime | ||
* method `#getContainerSecurityOpts` in `docker_container.go` for CRI | ||
Since rkt v1.26.0, the `NoNewPrivileges` option has been enabled in rkt. | ||
|
||
More details of the rkt implementation can be read | ||
[here](https://github.com/rkt/rkt/pull/2677). | ||
|
||
##Proposal to support "no new privileges" | ||
### Support in OCI runtimes | ||
|
||
To support "no new privileges" options in Kubernetes, it is proposed to make the following changes: | ||
Since version 0.3.0 of the OCI runtime specification, a user can specify the | ||
`noNewPrivs` boolean flag in the configuration file. | ||
|
||
###Changes of SecurityContext objects | ||
More details of the OCI implementation can be read | ||
[here](https://github.com/opencontainers/runtime-spec/pull/290). | ||
|
||
Add a new bool type field named `noNewPrivileges` to both `SecurityContext` definition and `PodSecurityContext` definition: | ||
* `noNewPrivileges=true` in `PodSecurityContext` means that all the containers in the pod should be run with `no-new-privileges` enabled. This should be a pod level control of `no-new-privileges` flag. | ||
* `noNewPrivileges` in `SecurityContext` is a container level control of `no-new-privileges` flag, and can override the pod level `noNewPrivileges` setting. | ||
## Existing SecurityContext objects | ||
|
||
By default, `noNewPrivileges` is `false`. | ||
Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` | ||
for `PodSpec`. `SecurityContext` objects define the related security options | ||
for Kubernetes containers, e.g. selinux options. | ||
|
||
The change of security context API objects requires the update of corresponding Kubernetes documents, need to submit another PR to track this. | ||
To support "no new privileges" options in Kubernetes, it is proposed to make | ||
the following changes: | ||
|
||
###Changes of docker runtime | ||
## Changes of SecurityContext objects | ||
|
||
When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getSecurityOpts` method in `docker_manager_xxx.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt` | ||
Add a new bool type field named `allowPrivilegeEscalation` to the `SecurityContext` | ||
definition. | ||
|
||
###Changes of CRI runtime | ||
By default, `allowPrivilegeEscalation` will be `false` at the kubelet level | ||
with the following exceptions: | ||
|
||
When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getContainerSecurityOpts` method in `docker_container.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt` | ||
- when a container is `privileged` | ||
- when `CAP_SYS_ADMIN` is added to a container | ||
- when a container is not run as root, uid `0` (to prevent breaking suid | ||
binaries) | ||
|
||
###Changes of kubectl | ||
When `allowPrivilegeEscalation` is set to false it will enable `no_new_privs` | ||
for that container. | ||
|
||
This is an additional proposal for kubectl. To improve kubectl user experience, we can add a new flag for kubectl command named `--security-opt`. This flag allows user to create pod with security options configured when using `kubectl run` command. For example, if user issues command like `kubectl run busybox --image=busybox --security-opt=no-new-privileges -- top`, kubernetes shall create a pod with `noNewPrivileges` enabled. | ||
`allowPrivilegeEscalation` in `SecurityContext` provides container level | ||
control of the `no_new_privs` flag and can override the default | ||
`allowPrivilegeEscalation` setting. | ||
|
||
If the proposal of kubectl changes is accepted, the patch can also be submitted as a separate PR. | ||
This requires changes to the Docker, rkt, and CRI runtime integrations so that | ||
kubelet will add the specific `no_new_privs` option. | ||
|
||
### Thoughts on breaking suid binaries | ||
|
||
Ideally we would not set `allowPrivilegeEscalation` to true for uids that are | ||
not `0`. But since we do not want to break existing deployments, we should add | ||
a way to override the default behavior so users _can_ set `no_new_privs` for | ||
uids that are not `0`. | ||
|
||
TODO: should this be another option? or should `allowPrivilegeEscalation` not | ||
be a boolean? |