- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Open Questions
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Privileged containers are containers that are enabled with similar access to the host as processes that run on the host directly. With privileged containers, users can package and distribute management operations and functionalities that require host access while retaining versioning and deployment methods provided by containers. Linux privileged containers are currently used for a variety of key scenarios in Kubernetes, including kube-proxy (via kubeadm), storage, and networking scenarios. Support for these scenarios in Windows currently requires workarounds via proxies or other implementations. This proposal aims to extend the Windows container model to support privileged containers. This proposal also aims to enable host network mode for privileged networking scenarios. Enabling privileged containers and host network mode for privileged containers would enable users to package and distribute key functionalities requiring host access.
The lack of privileged container support within the Windows container model has resulted in separate workarounds and privileged proxies for Windows workloads that are not required for Linux workloads. These workarounds have provided necessary functionality for key scenarios such as networking, storage, and device access, but have also presented many challenges, including increased available attack surfaces, complex change and update management, and scenario specific solutions. There is significant interest from the community for the Windows container model to support privileged containers and host network mode (which enable pods to be created in the host’s network compartment/namespace, as opposed to getting their own) to transition off such workarounds and align more closely with Linux support and operational models.
Furthermore, since kube-proxy cannot be run as a privileged daemonset, it must either be run with a proxy or directly on the host as a service. In the case that it is run as a service, the admin kube config must be stored on the Windows node which poses a security concern. This is also true for networking daemons such as Flannel.
- To provide a method to build, launch, and run a Windows-based container with privileged access to host resources, including the host network service, devices, disks (including hostPath volumes), etc.
- To enable access to host network resources for privileged containers and pods with host network mode
- To provide access to host network resources for non-privileged containers and pods.
- To provide a privileged mode for Hyper-V containers, or a method to run privileged process containers within a Hyper-V isolation boundary. This is a non-goal as running a Hyper-V container in the root namespace from within the isolation boundary is not supported.
- To enable privileged containers for Docker. This will only be for containerd.
- To align privileged containers with pod namespaces - this functionality may be addressed in a future KEP.
- Enabling the ability to mix privileged and non-privileged containers in the same Pod. (Multiple privileged containers running in the same Pod will be supported.)
Privileged daemon sets are used to deploy networking (CNI), storage (CSI), device plugins, kube-proxy, and other components to Linux nodes. Currently, similar set-up and deployment operations utilize wins or dedicated proxies (i.e. CSI-proxy, HNS-Proxy) or these components are installed as services running on Windows nodes. With Windows privileged containers many of these components could run inside containers increasing consistency between how they are deployed and/or managed on Linux and Windows. For networking scenarios, host network mode will enable these privileged deployments to access and configure host network resources.
Some interesting scenario examples:
- Cluster API
- CSI Proxy
- Logging Daemons
Windows privileged containers would also enable a wide variety of administrative tasks without requiring cluster operations to log onto each Windows nodes. Tasks like installing security patches, collecting specific event logs, etc could all be done via deployments of privileged containers.
- Host network mode support is only targeted for privileged containers and pods.
- Privileged pods can only consist of privileged containers. Standard Windows Server containers or other non-privileged containers will not be supported. This is because containers in a Kubernetes pod share an IP. For the privileged containers with host network mode, this container IP will be the host IP. As a result, a pod cannot consist of a privileged container with the host IP and an unprivileged Windows Server container(s) sharing a vNic on the host with a different IP, or vice versa.
- We are currently investigating service mesh scenarios where privileged containers in a pod will need host networking access but run alongside non-privileged containers in a pod. This would require further investigation and is out of scope for this KEP.
Most of the fundamental changes to enable this feature for Windows containers is dependent on changes within hcsshim, which serves as the runtime (container creation and management) coordinator and shim layer for containerd on Windows.
However:
- Several upstream changes are required to support this feature in Kubernetes, including changes to containerd, OCI, CRI, and kubelet. The identified changes include (see CRI and Kubelet Implementation Details below for more details on changes):
- Containerd: enabling host network mode for privileged containers and pods (working prototype demo). Prototype is done using containerd runtimehandler but this proposal is to use cri-api.
- OCI spec: https://github.com/opencontainers/runtime-spec
- Updates pending decisions made in this KEP regarding namings.
- CRI-api:
- Adding
WindowsPodSandboxConfig
andWindowsSandboxSecurityContext
message - Adding
host_process
flag toWindowsContainerSecurityContext
- Pass security context and flag of runtime spec to podsandbox spec (not currently supported, open issue: kubernetes/kubernetes#92963)
- Adding
- OCI spec: https://github.com/opencontainers/runtime-spec
- Kubelet: Pass host_process flag and windows security context options to runtime spec.
- Containerd: enabling host network mode for privileged containers and pods (working prototype demo). Prototype is done using containerd runtimehandler but this proposal is to use cri-api.
- There are risks that changes at each of these levels may not be supported.
- If containerd changes are not supported, host network mode will not be enabled.This would restrict the scenarios that privileged containers would enable, as CNI plugins, network policy, etc. rely on host network mode to enable access to host network resources.
- If CRI changes to enable a privileged flag are not supported, there would be a less-ideal workaround via annotations in the pod container spec.
- The CRI changes may make an annotation in the OCI spec until the OCI updates are included.
For alpha we will update Pod Security Standards with information on the new hostProcess
flag.
Additionally, privileged containers may impact other pod security policies (PSPs) outside of allowPrivilegedEscalation. We will provide guidance similar to Pod Security Standards for Windows privileged containers when graduating this feature out of alpha. There is an analysis for non-privileged containers which can be augmented with the details below.The anticipated impacted PSPs include:
Use case | Field name | Applicable | Scenario | Priority |
Running of privileged containers | privileged | no | Not applicable. Windows privileged containers will be controlled with a new `WindowsSecurityContextOptions.HostProcess` instead of the existing `privileged` field due to fundamental differences in their implementation on Windows. | Alpha |
Usage of host namespaces | HostPID, hostIPC | no | Windows does not have configurable PID/IPC namespaces (unlike Linux). Windows containers are always assigned their own process namespace. Job objects always run in the host's process namespace. These behaviors are not configurable. Future plans in this area include improvements to enable scheduling pods that can contain both normal and HostProcess/Job Object containers. These fields would not makes in this scenario because Windows cannot configure PID/IPC namespaces like in Linux. | N/A |
Usage of host networking and ports | hostnetwork | yes | Will be in host network by default initially. Support to set network to a different compartment may be desirable in the future. | Beta |
Usage of volume types | Volumes | no | Not applicable. | N/A |
Usage of the host filesystem | Allowed host paths | no | Job objects have full access to write to the root file system. Current design does not have a way to control access to read only. Instead privileged/job object containers can be ran as users with limited/scoped files system access via RunAsUsername | N/A |
Allow specific FlexVolume drivers | Flex volume | no | Not applicable. | N/A |
Allocating an FSGroup that owns the pod's volumes | Fsgroup (file system group) | no | The privileged container can be tied to run as a particular user that determines access to different fsgroups. | N/A |
The user and group IDs of the container | Runasuser, runasgroup, supplementalgroup | no | Assigning users to groups would have to occur at node provisioning, or via a privileged container deployment. | N/A |
Allowprivilegedescalation, default | no | Privilege via job objects is not granularly configurable. | N/A | |
Linux capabilities | Capabilities | no | Windows OS has a concept of “capabilities” (referred to as “privileged constants” but they are not supported in the platform today. | N/A |
Restrictions that could be applied to Windows Privileged Containers | Other restrictions for job objects | TBD | There are restrictions that could be enabled via the job object, i.e. UI restrictions | N/A |
Use GMSA with privileged containers | GMSA – would need to implement | yes | Required for auth to domain controller. | GA |
Windows privileged containers will be implemented with Job Objects, a break from the previous container model using server silos. Job objects provide the ability to manage a group of processes as a group, and assign resource constraints to the processes in the job. Job objects have no process or file system isolation, enabling the privileged payload to view and edit the host file system with the correct permissions, among other host resources. The init process, and any processes it launches or that are explicitly launched by the user, are all assigned to the job object of that container. When the init process exits or is signaled to exit, all the processes in the job will be signaled to exit, the job handle will be closed and the storage will be unmounted.
Because Windows privileged containers will work much differently than Linux privileged containers they will be referred to as HostProcess containers in kubernetes specs and user-facing documentation. Hopefully this will encourage users to seek documentation to better understand the capabilities and behaviors of these privileged containers.
- The container will be in the host’s network namespace (default network compartment) so it will have access to all the host’s network interfaces and have the host's IP as well.
- Resource limits (disk, memory, cpu count) will be applied to the job and will be job wide. For example, with a limit of 10 MB is set for the job, if every process in the jobs memory allocations added up exceeds 10 MB this limit would be reached. This is the same behavior as other Windows container types. These limits would be specified the same way they are currently for whatever orchestrator/runtime is being used.
- Note: HostProcess containers will have access to nodes root filesystem. Disk limits and resource usage will only apply to the scatch volume provisioned for each HostProcess container.
- The container's lifecycle will be managed by the container runtime just like other Windows container types.
-
By default
hostProcess
containers can run one of the following system accounts:NT AUTHORITY\SYSTEM
NT AUTHORITY\Local service
NT AUTHORITY\NetworkService
-
Running privileged containers as non SYSTEM/admin accounts will be the primary way operators can restrict access to system resources (files, registry, named pipes, WMI, etc).
-
To run a
hostProcess
container as a non SYSTEM/admin account, a local users Group must first be created on the host. Permissions to restrict access to system resources can should be configured to allow/deny access for the Group. When a newhostProcess
container is created with the name of a local users Group set as therunAsUserName
then a temporary user account will be created as a member of the specified group for the container to run as.- More information on Windows resource access can be found at https://docs.microsoft.com/archive/msdn-magazine/2008/november/access-control-understanding-windows-file-and-registry-permissions
- Example of configuring non SYSTEM/admin account can be found at microsoft/hcsshim#1286 (comment)
There will be two different behaviors for how volume mounts are configured in hostProcess
containers.
-
Bind Mounts
- With the approach Window's bind-filter driver will be used to create a view that merges the host's OS filesystem with container-local volumes.
- When
hostProcess
containers are started a new volume will be created which contains the contents of the container image. This volume will be mounted toc:\hpc
. - Additional volume mounts specified for
hostProcess
containers will be mounted at their requested location and can be access the same way as volume mounts in Linux or regular Windows Server containers.- ex: a volume with a mountPath of
/var/run/secrets/token
for containers will be mounted atc:\var\run\secrets\token
for containers.
- ex: a volume with a mountPath of
- Volume mounts will only be visible to the containers they are mounted into.
- The default working directory for
hostProcess
containers will also be set toc:\hpc
. - If a volume is mounted over a path that already exists on the host then the contents of the directory of the host, only the contents of the mounted volume will be visiable to the
hostProcess
container. This is the same behavior as regular Windows server behaviors.- A
warn
message will be written to the containerd logs if a volume is being mounted at a location that already exists on the host.
- A
-
Symlinks
- With this approach container image contents and volume mounts will be mounted at predicable paths on the host's filesystem.
- When
hostProcess
containers are started a new volume will be created which containers the contents of the container image. this volume will be mounted toc:\C\{container-id}
. - Additional volumes mounts specified for
hostProcess
containers will be mounted toc:\C\{container-id}\{mount-destination}
- ex: a volume with a mountPath of
/var/run/secrets/token
for a container with id1234
can be accessed atc:\C\1235\var\run\secrets\token
- ex: a volume with a mountPath of
- An environment variable
$CONTAINER_SANDBOX_MOUNT_POINT
will be set to the path where the container volume is mounted (c:\c\{container-id}
) to access content.- This environment variable can be used inside the Pod manifest / command line / args for containers.
A recording of the behavior differences from a SIG-Windows community meeting can be found here. Note - In the recording it was mentioned that this functionality might not be supported on WS2019. This functionality will be available in WS2019 but will require an OS patch (ETA: July 2022).
Additionally the following will be true for either volume mount behavior:
- Named Pipe mounts will not be supported.
Instead named pipes should be accessed via their path on the host (\\.\pipe\*).
The following error will be returned if
hostProcess
containers attempt to use name pipe mounts - https://github.com/microsoft/hcsshim/blob/358f05d43d423310a265d006700ee81eb85725ed/internal/jobcontainers/mounts.go#L40. - Unix domain sockets mounts also not not be supported for
hostProces
containers. Unix domain sockets can be accessed via their paths on the host like named pipes. - Mounting directories from the host OS into
hostProcess
containers will work just like with normal containers but this is not recommend. Instead workloads should access the host OS's file-system as if not being run in a container.- All other volume types supported for normal containers on Windows will work with
hostProcess
containers.
- All other volume types supported for normal containers on Windows will work with
HostProcess
Containers will have full access to the host file-system (unless restricted by filed-based ACLs and the run_as_username used to start the container).- There will be no
chroot
equivalent.
During the alpha/beta implementations of this feature only Symlink volume mount behavior was implemented. This implemention did unlock a lot of critical use cases for managing Windows nodes in Kubernets clusters but did have some usability issues (such as https://pkg.go.dev/k8s.io/client-go/rest#InClusterConfig not working as expected).
The bind mount volume mount behavior gives full access to the host OS's filesystem (an explicit goal of this enhancement) and addresses the usability issues with the initial approach. This approach requires the use of Windows OS APIs that were not present in Windows Server 2019 during alpha/beta implementations of this feature. These APIs will be available in WS2019 beginning in July 2022 with the monthly OS security patches. Containerd v1.7+ will be required for this behavior.
- On containerd v1.6 symlink volume mount behavior will always be used.
- On containerd v1.7 bind volume mount behavior will always be used.
- Backwards compatibility with volume mount paths has been added into containerd v1.7. This means that existing workloads that used
$CONTAINER_SANDBOX_MOUNT_POINT
to access volume mounts will work without updates.
- Backwards compatibility with volume mount paths has been added into containerd v1.7. This means that existing workloads that used
HostProcess
containers can be built on top of existing Windows base images (nanoserver, servercore, etc).- A new Windows container base image has been introduced for
hostProcess
containers. More info is available at (https://github.com/microsoft/windows-host-process-containers-base-image)- Note:
HostProcess
containers do not inherit the same compatibility requirements as process isolated containers from an OS perspective but individual container runtimes may have different image pulling/ platform matching behavior.
- Note:
HostProcess
container images based on nanoserver can be built with Docker.HostProcess
container images based on the new base image must be built with buildkit.- Only a subset of dockerfile operations will be supported (ADD, COPY, PATH, ENTRYPOINT, etc).
- Note: The subset of dockerfile operations supported for
HostProcess
containers is very close to the subset of operations supported when building images for other OS's with buildkit (similar to how the pause image is built in kubernetes/kubernetes)
- Note: The subset of dockerfile operations supported for
- Documentation on building
HostProcess
containers will be added at either docs.microsoft.com or a new github repository.
We will need to add a hostProcess
field to the runtime spec. We can model this after the Linux pod security context and container security context that is a boolean that is set to true
for privileged containers. References:
For Windows we are proposing the following updates to CRI-API:
Add WindowsPodSandboxConfig (and it to PodSandboxConfig)
message WindowsPodSandboxConfig {
WindowsSandboxSecurityContext security_context = 1;
}
Add WindowsSandboxSecurityContext:
message WindowsSandboxSecurityContext {
string run_as_username = 1;
string credential_spec = 2;
bool host_process = 3;
}
Update WindowsContainerSecurityContext by adding host_process field:
message WindowsContainerSecurityContext {
string run_as_username = 1;
string credential_spec = 2;
bool host_process = 3;
}
Note: For alpha annotations on RunPodSandbox and CreateContainer CRI calls may be used until a version of containerd with Windows privileged container support is released.
A new *bool
field named hostProcess
will be added to WindowsSecurityContextOptions.
On Windows, all containers in a pod must be privileged. Because of this behavior and because WindowsSecurityContextOptions
already exists on both PodSecurityContext
and Container.SecurityContext
Windows containers will use this new field instead re-using the existing privileged
field which only exists on SecurityContext
.
Additionally, the existing privileged
field does not clearly describe what capabilities the container has (see kubernetes/kubernetes#44503).
Documentation will be added to clearly describe what capabilities these new "hostProcess" containers have.
Current behavior applies PodSecurityContext.WindowsSecurityContextOptions
settings to all Container.SecurityContext.WindowsSecurityContextOptions
unless those settings are already specified on the container. To address this the following API validation will be added:
- If
PodSecurityContext.WindowsSecurityContextOptions.HostProcess = true
is set to true then no container in the pod setsContainer.SecurityContext.WindowsSecurityContextOptions.HostProcess = false
- If
PodSecurityContext.WindowsSecurityContextOptions.HostProcess
is not set then all containers in a pod must setContainer.SecurityContext.WindowsSecurityContextOptions.HostProcess = true
- If
PodSecurityContext.WindowsSecurityContextOptions.HostProcess = false
no containers may setContainer.SecurityContext.WindowsSecurityContextOptions.HostProcess = true
hostNetwork = true
must be set explicits if the pod contains all hostProcess containers (this value will not be inferred and/or defaulted)
Additionally kube-apiserver will disallow hostProcess
containers to be scheduled if --allow-privileged=false
is passed as an argument.
https://github.com/kubernetes/kubernetes/blob/release-1.20/pkg/apis/core/validation/validation.go#L5767-L5771 for reference.
Option 1: Re-use SecurityContext.Privileged
field.
Re-using the existing SecurityContext.Privileged
field was considering and here are the pros/cons considered:
Pros
- The field already exists and many policy tools already leverage it.
Cons
- Privileged containers on Windows will operate very differently than privileged containers on Linux. Having a new field should help avoid confusion around the differences between the two.
- The privileged field does not have clear meaning for Linux containers today (see comments above).
WindowsSecurityContextOptions.RunAsUserName
will the the primary way of restricting access to host/node resources (See Container users). It is desirable thatRunAsUserName
andHostProcess
fields live on the same property.- API validation to ensure all containers are either privileged or not will be difficult because there is no way of definitively knowing that a pod is intended for a Windows node.
Host Network mode for privileged Windows containers will always be enabled, as the pod will automatically get the host IP.
Privileged Windows containers will be unable to align to pod namespaces due to limitations in the Windows OS. This functionality will likely be enabled in the future through a new KEP.
Because of this we will require that hostNetwork
is set to true
when scheduling privileged pods. This will allow existing policy tools to detect and act on privileged Windows containers without any updates. In the future if/when functionality is added to support joining privileged containers to pod networks this validation will be revisited.
Here are two examples of valid specs each containing two privileged Windows containers:
spec:
hostNetwork: true
securityContext:
windowsOptions:
hostProcess: true
containers:
- name: foo
image: image1:latest
- name: bar
image: image2:latest
nodeSelector:
"kubernetes.io/os": windows
spec:
hostNetwork: true
containers:
- name: foo
image: image1:latest
securityContext:
windowsOptions:
hostProcess: true
- name: bar
image: image2:latest
securityContext:
windowsOptions:
hostProcess: true
nodeSelector:
"kubernetes.io/os": windows
Kubelet will pass privileged flag from WindowsSecurityContextOptions
to the appropriate CRI layer calls.
Note: For alpha kubelet may add well-known annotations to CRI calls if privileged flags are set.
Add functionality to Kuberuntime_sandbox to:
- Split out the linux sandbox creation and add windows sandbox creation
- Configure all privileged Windows pods to join the host network
The following extra validation will be added to the kubelet for Windows. These checks will ensure privileged pods work correctly on Windows if these are not validated by apiserver.
- Ensure all containers in a pod privileged, if any are.
- Ensure
hostNetwork = true
is set if pod contains privileged containers.
There are no plans to update Docker and/or dockershim to have support for privileged containers due to requirements on HCSv2. Currently containerd is the only container runtime with a Windows implementation so containerd will be required.
Validation will be added in the kubelet to fail to schedule a pod if the node is configured to use dockershim and the pod contains privileged Windows containers.
Privileged container functionally on windows will be gated behind a new WindowsHostProcessContainers
feature gate.
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#feature-stages
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Alpha
- Unit tests around validation logic for new API fields.
- Add e2e test validating basic privileged container functionality (pod starts and run in a privileged context on the node)
- Update Pod Security Standards doc to dissallow
hostProcess
containers in the baseline/default and restricted policies.
Beta
- Validate running kubeproxy as a daemon set
- Validate CSI-proxy running as a daemon set
- Validate running a CNI implementation as a daemon set
- Validate behaviors of various volume mount types as described in Container Mounts with e2e tests
- Add e2e tests to test different ways to construct paths for container command, args, and workingDir fields for both
hostProcess
and non-hostProcess containers. These tests will include constructing paths with and without$CONTAINER_SANDBOX_MOUNT_POINT
set and with different combinations of forward and backward slashes.
Graduation
- Add e2e tests to validate running
hostProcess
containers as non SYSTEM/admin accounts - Update e2e tests for new volume mount behavior as described in Container Mounts
No additional tests have been identified that would be required prior to implementing this enhancement.
<k8s.io/kubernetes/pkg/api/pod>
:<2022-05-27>
-<66.7%>
<k8s.io/kubernetes/pkg/apis/core>
:<2022-05-27>
-<78.9%>
<k8s.io/kubernetes/pkg/kubelet/container>
:<2022-05-27>
-<52.1%>
<k8s.io/kubernetes/pkg/kubelet/kuberuntime>
:<2022-05-27>
-<66.7%>
<k8s.io/kubernetes/pkg/securitycontext>
:<2022-05-27>
-<66.8%>
<k8s.io/cri-api/pkg/apis/runtime/v1>
:<2022-05-27>
- No unit test coverage - protobuf definition<k8s.io/test/e2e/windows>
:<2022-05-27>
- No unit test coverage - this package contains e2e test code
It is not currently possible to test Windows specific code through existing the integration test frameworks. For this enhancement unit and e2e tests will be used for validation.
- [sig-windows] [Feature:WindowsHostProcessContainers] [MinimumKubeletVersion:1.22] HostProcess containers should run as a process on the host/node: source
- [sig-windows] [Feature:WindowsHostProcessContainers] [MinimumKubeletVersion:1.22] HostProcess containers should support init containers: source
- [sig-windows] [Feature:WindowsHostProcessContainers] [MinimumKubeletVersion:1.22] HostProcess containers container command path validation: source
- [sig-windows] [Feature:WindowsHostProcessContainers] [MinimumKubeletVersion:1.22] HostProcess containers metrics should report count of started and failed to start HostProcess containers: source
- [sig-windows] [Feature:WindowsHostProcessContainers] [MinimumKubeletVersion:1.22] HostProcess containers should support various volume mount types: source
TestGrid job for Kubernetes@master - (https://testgrid.k8s.io/sig-windows-signal#capz-windows-containerd-master&include-filter-by-regex=WindowsHostProcessContainers)
k8s-triage link - (https://storage.googleapis.com/k8s-triage/index.html?job=capz&test=Feature%3AWindowsHostProcessContainers)
Alpha plan
- Version of containerd: Target v1.5
- Version of Kubernetes: Target 1.22
- OS support: Windows 2019 LTSC and all future versions of Windows Server
- Alpha Feature Gate for passing privileged flag or annotations to CRI calls.
Graduation to Beta
- Kubernetes Target 1.23
- Set
WindowsHostProcessContainers
feature gate tobeta
- Go through PSP Linux test (e2e: validation & conformance) and make them relevant for Windows (which apply, which don't and where we need to write new tests).
- Provide guidance similar to Pod Security Standards for Windows privileged containers.
- CRI Support for HostProcess containers.
- Containerd release is available with HostProcess support (Either v1.6 OR changes backported to a v1.5 patch) - (containerd/containerd#5131)
- Windows Host Process annotations removed from CRI. (Discussed at (kubernetes/kubernetes#99576 (comment)))
- OS support: Windows 2019 LTSC and all future versions of Windows Server.
- Documentation for
HostProcess
containers on https://kubernetes.io/.- Includes clarification around disk limits mentioned in Resource Limits.
- Documentation on docs.microsoft.com for building
HostProcess
container images.
- Update validation logic for
HostProcess
containers in api-server to handle ephemeral containers- Note: If ephemeral container is also a
HostProcess
container then all containers in the pod must also beHostProcess
containers (and vise versa).
- Note: If ephemeral container is also a
Graduation to GA:
- Add documentation for running as a non-SYSTEM/admin account to k8s.io
- Update documention on how volume mounts are set up for
hostProcess
containers on k8s.io - Set
WindowsHostProcessContainers
feature gate toGA
- Provide reference images/workloads using the
bind
volume mounting behavior in Cluster-API-Provider-azure (which is used to run the majority of Windows e2e test passes). - Migrate all deployments using
hostProcess
containers under in the sig-windows-tools repo to be compatible withbind
volume behavior.
- Windows: This implementation requires no backports for OS components.
- Kubernetes: No changes required outside of ensuring feature gates are set while feature is in development.
- Containerd: Must run a version of containerd with privileged container support (targeting v1.5+).
N/A
This section must be completed when targeting alpha to a release.
-
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: WindowsHostProcessContainers
- Components depending on the feature gate: Kubelet, kube-apiserver
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
- Feature gate (also fill in values in
-
Does enabling the feature change any default behavior? No
-
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? This feature can be disabled. If this feature flag is disabled in kube-apiserver than new pods which try to schedule
hostProcess
containers will be rejected by kube-apiserver. If this flag is disabled in the kubelet then newhostProcess
containers are will not be started and an appropriate event will be emitted. -
What happens if we reenable the feature if it was previously rolled back? Newly created privileged Windows containers will run as expected.
-
Are there any tests for feature enablement/disablement? No
This section must be completed when targeting beta graduation to a release.
-
How can a rollout fail? Can it impact already running workloads? Try to be as paranoid as possible - e.g., what if some components will restart mid-rollout?
-
What specific metrics should inform a rollback?
-
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and can't do that now.
-
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? Even if applying deprecation policies, they may still surprise some users.
This section must be completed when targeting beta graduation to a release.
-
How can an operator determine if the feature is in use by workloads?
Kubelet metrics will be updated to report the number of HostProcess containers started and number of errors started.
TBD: Confirm the best way to accomplish this is to add new values/metric labels for
StartedContainersTotal
andStartedContainersError
counters. Otherwise we could add new counters. -
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: started_host_process_containers_total - reports the total number of host-process containers started on a given node
- Metric name: started_host_process_containers_errors_total - reports the total number of host-process containers that have failed to start given node.
- [Optional] Aggregation method:
- Components exposing the metric: Kubelet
- Notes: Both metrics were added in v1.23 and are validated with e2e tests
- Other (treat as last resort)
- Details:
- Metrics
-
What are the reasonable SLOs (Service Level Objectives) for the above SLIs? The same SLOs for starting/stopping non-hostprocess containers would apply here.
-
Are there any missing metrics that would be useful to have to improve observability of this feature? N/A
This section must be completed when targeting beta graduation to a release.
-
Does this feature depend on any specific services running in the cluster?
- [ContainerD]
- Usage description:
HostProcess
containers support will not be added to dockershim.- Containerd v1.6.x is required for
symlink
volume mount behavior - Containerd v1.7+ is required for
bind
volume mount behavior. - Impact of its outage on the feature: Containers will fail to start.
- Impact of its degraded performance or high-error rates on the feature: Containers may behave expectantly and node may go into the NotReady state.
- Usage description:
- [ContainerD]
For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.
-
Will enabling / using this feature result in any new API calls? No
-
Will enabling / using this feature result in introducing new API types? No
-
Will enabling / using this feature result in any new calls to the cloud provider? No
-
Will enabling / using this feature result in increasing size or count of the existing API objects? A new field is being added so API object size will grow slightly larger.
-
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? No
-
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? No -
HostProcess
containers will honor limits/reserves specified in the specs and will count against node quota just like unprivileged containers.
The Troubleshooting section currently serves the Playbook
role. We may consider
splitting it into a dedicated Playbook
document (potentially with some monitoring
details). For now, we leave it here.
This section must be completed when targeting beta graduation to a release.
-
How does this feature react if the API server and/or etcd is unavailable? This feature will not change any behaviors around Pod scheduling if API server and/or etcd is unavailable.
-
What are other known failure modes? For each of them, fill in the following information by copying the below template:
- [InClusterConfig() fails inside HostProcessContainers]
- Causes:
- Due to how volume mounts for HostProcess containers are configured in containerd v1.6.X, service account tokens in the container are not present at the location expected by the golang API rest client.
- Mitigations:
- If containers are using `symlink` mount behavior as described at [Compatibility](#compatibility) you can construct a kubeconfig
file with the containers assigned service account token and use that to authenticate.
Example: https://github.com/kubernetes-sigs/sig-windows-tools/blob/fbe00b42e2a5cca06bc182e1b6ee579bd65ed1b5/hostprocess/calico/install/calico-install.ps1#L8-L11
- Switch to container runtime/version that supports `bind` mount behavior as as described at [Compatibility](#compatibility)
- Diagnostics:
- Calls to rest.InClusterConfig will fail in workloads.
- Testing:
- No - known limitation
- [Containers running as non-HostProcessContainers]
- Causes:
- Container runtime does not support HostProcessContainers
- Bug in kubelet in some v1.23/v1.24 patch versions [#110140](https://github.com/kubernetes/kubernetes/pull/110140)
- Detection:
- Varies based on cause
- Likely result will be an error in the app/workload running running inside containers
- Mitigations:
- If error is caused by [#110140](https://github.com/kubernetes/kubernetes/pull/110140) then either
specify `PodSecurityContext.WindowsSecurityContextOptions.HostProcess=true` (instead of setting HostProcess=true on container[*].SecurityContext.WindowsSecurityContextOptions.HostProcess=true) or upgrade kubelet to a version fix for issue.
- Provision nodes with a containerd v1.6+
- Diagnostics:
- Exec into a container and run `whoami` and ensure running user is as expected (ex: not ContainerUser or ContainerAdministrator for HostProcessContainers)
- Run `kubectl get nodes -o wide` to check the container runtime and version for nodes
- Examine container logs
- On the node run `crictl inspectp [podid]` and ensure pod has "microsoft.com/hostprocess-container": "true" in annotation list (to detect [#110140](https://github.com/kubernetes/kubernetes/pull/110140))
- Inspect container `trace` log messages and ensure `hostProcess=true` is set for `RunPodSandbox` calls.
- Testing:
- Yes - tests have been added to [#110140](https://github.com/kubernetes/kubernetes/pull/110140) to catch issues caused by this bug.
- [HostProcess containers fail to start with `failed to create user process token: failed to logon user: Access is denied.: unknown`]
- Causes:
- Containerd is running as a user account.
On Windows user accounts (even Administrator accounts) cannot create logon tokens for system (which can be used by HostProcessContainers).
- Detection:
- Metrics: **started_host_process_containers_errors_total** count increasing
- Events: ContainerCreate failure events with reason of `failed to create user process token: failed to logon user: Access is denied.: unknown`
- Mitigations:
- Run containerd as `LocalSystem` (default) or `LocalService` service accounts
- Diagnostics:
- On the node run `Get-Process containerd -IncludeUserName` to see which account containerd is running as.
- Testing:
- No - It is not feasible to restart the container runtime as a different user during tests passes.
- What steps should be taken if SLOs are not being met to determine the problem?
Kubelet and/or containerd logs will need to be inspected if problems are encountered creating HostProcess containers on Windows nodes.
- 2020-09-11: Issue #1981 created.
- 2021-12-17: Initial KEP draft merged - #2037.
- 2021-02-17: KEP approved for alpha release - #2288.
- 2021-05-20: Alpha implementation PR merged - kubernetes/kubernetes#99576.
- 2021-08-05: K8s 1.22 released with alpha support for
WindowsHostProcessContainers
feature. - 2021-08-21: HostProcessContainers (via CRI) support added to containerd - containerd/containerd#5131.
- 2021-12-07: K8s 1.23 released with beta support for
WindowsHostProcessContainers
feature. - 2022-02-15: Containerd 1.6.0 relased with support for HostProcessContainers.
- 2022-12-08: K8s 1.26 released with
WindowsHostProcessContainers
feature stable.
-
Use containerd Runtimehandlers and K8s RuntimeClasses - Runtimehandlers are using the prototype. Adding the ability to the CRI provides kubelet to have more control over the security context and and fields that it allows through giving additional checks (such as runasnonroot).
-
Use annotations on CRI to pass privileged flag to containerd - Adding the field to the CRI spec allows for the existing CRI calls to work as is. The resulting code is cleaner and doesn’t rely on magic strings. There is currently a PR for adding the SecurityFields to the CRI API adding Sandbox level security support for window containers. The Runasusername will be required for privileged containers to make sure every container (including pause) runs as the correct user to limit access to the file system.
- What’s the future of plug-ins that will be impacted
- What will be the future CSI-proxy and other plug-ins that will be impacted?
- CSI-proxy and HNS-proxy are likely to be impacted
- Container base image support
- Is “from scratch” required
- Would a slimmer “privileged base image” be more desirable than using stand server core
- Container image build differences with traditional windows server and impacts on image use and distribution
- Should PSP be updated with latest checks or should out-of-tree enforcement tool be used?
- PSP will be depreciated and documentation and guidance should be produced for Security Standards. Implementations in out-of-tree enforcement should be favored and POC/impementation in gatekeeper would be a great way to demonstrate.
- Scheduling checks
- Privileged containers in the same network compartment as the non-privileged pod, and otherwise init privileged containers may be able to still access the host network