Skip to content

Commit

Permalink
update no new privs proposal
Browse files Browse the repository at this point in the history
Signed-off-by: Jess Frazelle <acidburn@google.com>
  • Loading branch information
jessfraz committed May 19, 2017
1 parent 3bd5f9f commit 93c3a51
Showing 1 changed file with 86 additions and 34 deletions.
120 changes: 86 additions & 34 deletions contributors/design-proposals/no-new-privs.md
Original file line number Diff line number Diff line change
@@ -1,65 +1,117 @@
#Support "no new privileges" in Kubernetes
# No New Privileges

##Description
- [Description](#description)
* [Interactions with other Linux primitives](#interactions-with-other-linux-primitives)
- [Current Implementations](#current-implementations)
* [Support in Docker](#support-in-docker)
* [Support in rkt](#support-in-rkt)
* [Support in OCI runtimes](#support-in-oci-runtimes)
- [Existing SecurityContext objects](#existing-securitycontext-objects)
- [Changes of SecurityContext objects](#changes-of-securitycontext-objects)

In Linux, the `execve` system call can grant more privileges to a newly-created process than its parent process. Considering security issues, since Linux kernel v3.5, there is a new flag named `no_new_privs` added to prevent those new privileges from being granted to the processes.
## Description

`no_new_privs` is inherited across `fork`, `clone` and `execve` and can not be unset. With `no_new_privs` set, `execve` promises not to grant the privilege to do anything that could not have been done without the `execve` call.
In Linux, the `execve` system call can grant more privileges to a newly-created
process than its parent process. Considering security issues, since Linux kernel
v3.5, there is a new flag named `no_new_privs` added to prevent those new
privileges from being granted to the processes.

For more details about `no_new_privs`, please check the Linux kernel document [here](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt).
[`no_new_privs`](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt)
is inherited across `fork`, `clone` and `execve` and can not be unset. With
`no_new_privs` set, `execve` promises not to grant the privilege to do anything
that could not have been done without the `execve` call.

Docker started to support `no_new_privs` option since 1.11. Here is the [link](https://github.com/docker/docker/issues/20329) of the ticket in Docker community to support `no_new_privs` option.
For more details about `no_new_privs`, please check the
[Linux kernel documention](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt).

We want to support the creation of containers with `no_new_privs` enabled in Kubernetes, which will make the Kubernetes cluster more safe. Here is the [link](https://github.com/kubernetes/kubernetes/issues/38417) of the ticket in Kubernetes community to track this proposal.
This is different from `NOSUID` in that `no_new_privs`can give permission to
the container process to further restrict child processes with seccomp. This
permission goes only one-way in that the container process can not grant more
permissions, only further restrict.

### Interactions with other Linux primitives

##Current implementation
- suid binaries: will break when `no_new_privs` is enabled
- seccomp2 as a non root user: requires `no_new_privs`
- seccomp2 with dropped `CAP_SYS_ADMIN`: requires `no_new_privs`
- ambient capabilities: requires `no_new_privs`
- selinux transactions: bugs that were fixed documented [here](https://github.com/moby/moby/issues/23981#issuecomment-233121969)

###Support in Docker

Since Docker 1.11, user can specify `--security-opt` to enable `no_new_privs` while creating containers, e.g. `docker run --security-opt=no-new-privileges busybox`
## Current Implementations

For program client, Docker provides an object named `ContainerCreateConfig` defined in package `github.com/docker/engine-api/types` to config container creation parameters. In this object, there is a string array `HostConfig.SecurityOpt` to specify the security options. Client can utilize this field to specify the arguments for security options while creating new containers.
### Support in Docker

###Support in OCI runtimes
Since Docker 1.11, a user can specify `--security-opt` to enable `no_new_privs`
while creating containers, for example
`docker run --security-opt=no_new_privs busybox`.

Since version 0.3.0 of the OCI runtime specification, a user can specify the `noNewPrivs` boolean flag in the configuration file.
Docker provides via their Go api an object named `ContainerCreateConfig` to
configure container creation parameters. In this object, there is a string
array `HostConfig.SecurityOpt` to specify the security options. Client can
utilize this field to specify the arguments for security options while
creating new containers.

More details of OCI implementation can be checked [here](https://github.com/opencontainers/runtime-spec/pull/290).
This field did not scale well for the Docker client, so it's suggested that
Kubernetes does not follow that design.

###SecurityContext in Kubernetes
More details of the Docker implementation can be read
[here](https://github.com/moby/moby/pull/20727) as well as the original
discussion [here](https://github.com/moby/moby/issues/20329).

Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` for `PodSpec`. `SecurityContext` objects define the related security options for Kubernetes containers, e.g. selinux options.
### Support in rkt

While creating a container, kubelet parses the security context object and formats the security option strings for Docker. The security options strings will finally be inserted into `ContainerCreateConfig.HostConfig.SecurityOpt` and passed to Docker. Different Kubernetes runtimes now are using different methods to parse and format the security option strings:
* method `#getSecurityOpts` in `docker_mager_xxxx.go` for Docker runtime
* method `#getContainerSecurityOpts` in `docker_container.go` for CRI
Since rkt v1.26.0, the `NoNewPrivileges` option has been enabled in rkt.

More details of the rkt implementation can be read
[here](https://github.com/rkt/rkt/pull/2677).

##Proposal to support "no new privileges"
### Support in OCI runtimes

To support "no new privileges" options in Kubernetes, it is proposed to make the following changes:
Since version 0.3.0 of the OCI runtime specification, a user can specify the
`noNewPrivs` boolean flag in the configuration file.

###Changes of SecurityContext objects
More details of the OCI implementation can be read
[here](https://github.com/opencontainers/runtime-spec/pull/290).

Add a new bool type field named `noNewPrivileges` to both `SecurityContext` definition and `PodSecurityContext` definition:
* `noNewPrivileges=true` in `PodSecurityContext` means that all the containers in the pod should be run with `no-new-privileges` enabled. This should be a pod level control of `no-new-privileges` flag.
* `noNewPrivileges` in `SecurityContext` is a container level control of `no-new-privileges` flag, and can override the pod level `noNewPrivileges` setting.
## Existing SecurityContext objects

By default, `noNewPrivileges` is `false`.
Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext`
for `PodSpec`. `SecurityContext` objects define the related security options
for Kubernetes containers, e.g. selinux options.

The change of security context API objects requires the update of corresponding Kubernetes documents, need to submit another PR to track this.
To support "no new privileges" options in Kubernetes, it is proposed to make
the following changes:

###Changes of docker runtime
## Changes of SecurityContext objects

When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getSecurityOpts` method in `docker_manager_xxx.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt`
Add a new bool type field named `allowPrivilegeEscalation` to the `SecurityContext`
definition.

###Changes of CRI runtime
By default, `allowPrivilegeEscalation` will be `false` at the kubelet level
with the following exceptions:

When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getContainerSecurityOpts` method in `docker_container.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt`
- when a container is `privileged`
- when `CAP_SYS_ADMIN` is added to a container
- when a container is not run as root, uid `0` (to prevent breaking suid
binaries)

###Changes of kubectl
When `allowPrivilegeEscalation` is set to false it will enable `no_new_privs`
for that container.

This is an additional proposal for kubectl. To improve kubectl user experience, we can add a new flag for kubectl command named `--security-opt`. This flag allows user to create pod with security options configured when using `kubectl run` command. For example, if user issues command like `kubectl run busybox --image=busybox --security-opt=no-new-privileges -- top`, kubernetes shall create a pod with `noNewPrivileges` enabled.
`allowPrivilegeEscalation` in `SecurityContext` provides container level
control of the `no_new_privs` flag and can override the default
`allowPrivilegeEscalation` setting.

If the proposal of kubectl changes is accepted, the patch can also be submitted as a separate PR.
This requires changes to the Docker, rkt, and CRI runtime integrations so that
kubelet will add the specific `no_new_privs` option.

### Thoughts on breaking suid binaries

Ideally we would not set `allowPrivilegeEscalation` to true for uids that are
not `0`. But since we do not want to break existing deployments, we should add
a way to override the default behavior so users _can_ set `no_new_privs` for
uids that are not `0`.

TODO: should this be another option? or should `allowPrivilegeEscalation` not
be a boolean?

0 comments on commit 93c3a51

Please sign in to comment.