Skip to content

Latest commit

 

History

History
1248 lines (987 loc) · 66.6 KB

File metadata and controls

1248 lines (987 loc) · 66.6 KB

KEP-1981: Windows Privileged Containers and Host Networking Mode

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests for meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Privileged containers are containers that are enabled with similar access to the host as processes that run on the host directly. With privileged containers, users can package and distribute management operations and functionalities that require host access while retaining versioning and deployment methods provided by containers. Linux privileged containers are currently used for a variety of key scenarios in Kubernetes, including kube-proxy (via kubeadm), storage, and networking scenarios. Support for these scenarios in Windows currently requires workarounds via proxies or other implementations. This proposal aims to extend the Windows container model to support privileged containers. This proposal also aims to enable host network mode for privileged networking scenarios. Enabling privileged containers and host network mode for privileged containers would enable users to package and distribute key functionalities requiring host access.

Motivation

The lack of privileged container support within the Windows container model has resulted in separate workarounds and privileged proxies for Windows workloads that are not required for Linux workloads. These workarounds have provided necessary functionality for key scenarios such as networking, storage, and device access, but have also presented many challenges, including increased available attack surfaces, complex change and update management, and scenario specific solutions. There is significant interest from the community for the Windows container model to support privileged containers and host network mode (which enable pods to be created in the host’s network compartment/namespace, as opposed to getting their own) to transition off such workarounds and align more closely with Linux support and operational models.

Furthermore, since kube-proxy cannot be run as a privileged daemonset, it must either be run with a proxy or directly on the host as a service. In the case that it is run as a service, the admin kube config must be stored on the Windows node which poses a security concern. This is also true for networking daemons such as Flannel.

Goals

  • To provide a method to build, launch, and run a Windows-based container with privileged access to host resources, including the host network service, devices, disks (including hostPath volumes), etc.
  • To enable access to host network resources for privileged containers and pods with host network mode

Non-Goals

  • To provide access to host network resources for non-privileged containers and pods.
  • To provide a privileged mode for Hyper-V containers, or a method to run privileged process containers within a Hyper-V isolation boundary. This is a non-goal as running a Hyper-V container in the root namespace from within the isolation boundary is not supported.
  • To enable privileged containers for Docker. This will only be for containerd.
  • To align privileged containers with pod namespaces - this functionality may be addressed in a future KEP.
  • Enabling the ability to mix privileged and non-privileged containers in the same Pod. (Multiple privileged containers running in the same Pod will be supported.)

Proposal

Use case 1: Privileged Daemon Sets

Privileged daemon sets are used to deploy networking (CNI), storage (CSI), device plugins, kube-proxy, and other components to Linux nodes. Currently, similar set-up and deployment operations utilize wins or dedicated proxies (i.e. CSI-proxy, HNS-Proxy) or these components are installed as services running on Windows nodes. With Windows privileged containers many of these components could run inside containers increasing consistency between how they are deployed and/or managed on Linux and Windows. For networking scenarios, host network mode will enable these privileged deployments to access and configure host network resources.

Some interesting scenario examples:

  • Cluster API
  • CSI Proxy
  • Logging Daemons

Use case 2: Administrative tasks

Windows privileged containers would also enable a wide variety of administrative tasks without requiring cluster operations to log onto each Windows nodes. Tasks like installing security patches, collecting specific event logs, etc could all be done via deployments of privileged containers.

Notes/Constraints/Caveats (Optional)

  • Host network mode support is only targeted for privileged containers and pods.
  • Privileged pods can only consist of privileged containers. Standard Windows Server containers or other non-privileged containers will not be supported. This is because containers in a Kubernetes pod share an IP. For the privileged containers with host network mode, this container IP will be the host IP. As a result, a pod cannot consist of a privileged container with the host IP and an unprivileged Windows Server container(s) sharing a vNic on the host with a different IP, or vice versa.
  • We are currently investigating service mesh scenarios where privileged containers in a pod will need host networking access but run alongside non-privileged containers in a pod. This would require further investigation and is out of scope for this KEP.

Risks and Mitigations

Most of the fundamental changes to enable this feature for Windows containers is dependent on changes within hcsshim, which serves as the runtime (container creation and management) coordinator and shim layer for containerd on Windows.

However:

  • Several upstream changes are required to support this feature in Kubernetes, including changes to containerd, OCI, CRI, and kubelet. The identified changes include (see CRI and Kubelet Implementation Details below for more details on changes):
    • Containerd: enabling host network mode for privileged containers and pods (working prototype demo). Prototype is done using containerd runtimehandler but this proposal is to use cri-api.
      • OCI spec: https://github.com/opencontainers/runtime-spec
        • Updates pending decisions made in this KEP regarding namings.
      • CRI-api:
        • Adding WindowsPodSandboxConfig and WindowsSandboxSecurityContext message
        • Adding host_process flag to WindowsContainerSecurityContext
        • Pass security context and flag of runtime spec to podsandbox spec (not currently supported, open issue: kubernetes/kubernetes#92963)
    • Kubelet: Pass host_process flag and windows security context options to runtime spec.
  • There are risks that changes at each of these levels may not be supported.
    • If containerd changes are not supported, host network mode will not be enabled.This would restrict the scenarios that privileged containers would enable, as CNI plugins, network policy, etc. rely on host network mode to enable access to host network resources.
    • If CRI changes to enable a privileged flag are not supported, there would be a less-ideal workaround via annotations in the pod container spec.
    • The CRI changes may make an annotation in the OCI spec until the OCI updates are included.

Pod Security Implications

For alpha we will update Pod Security Standards with information on the new hostProcess flag.

Additionally, privileged containers may impact other pod security policies (PSPs) outside of allowPrivilegedEscalation. We will provide guidance similar to Pod Security Standards for Windows privileged containers when graduating this feature out of alpha. There is an analysis for non-privileged containers which can be augmented with the details below.The anticipated impacted PSPs include:

Use case Field name Applicable Scenario Priority
Running of privileged containers privileged no Not applicable. Windows privileged containers will be controlled with a new `WindowsSecurityContextOptions.HostProcess` instead of the existing `privileged` field due to fundamental differences in their implementation on Windows. Alpha
Usage of host namespaces HostPID, hostIPC no Windows does not have configurable PID/IPC namespaces (unlike Linux). Windows containers are always assigned their own process namespace. Job objects always run in the host's process namespace. These behaviors are not configurable. Future plans in this area include improvements to enable scheduling pods that can contain both normal and HostProcess/Job Object containers. These fields would not makes in this scenario because Windows cannot configure PID/IPC namespaces like in Linux. N/A
Usage of host networking and ports hostnetwork yes Will be in host network by default initially. Support to set network to a different compartment may be desirable in the future. Beta
Usage of volume types Volumes no Not applicable. N/A
Usage of the host filesystem Allowed host paths no Job objects have full access to write to the root file system. Current design does not have a way to control access to read only. Instead privileged/job object containers can be ran as users with limited/scoped files system access via RunAsUsername N/A
Allow specific FlexVolume drivers Flex volume no Not applicable. N/A
Allocating an FSGroup that owns the pod's volumes Fsgroup (file system group) no The privileged container can be tied to run as a particular user that determines access to different fsgroups. N/A
The user and group IDs of the container Runasuser, runasgroup, supplementalgroup no Assigning users to groups would have to occur at node provisioning, or via a privileged container deployment. N/A
Allowprivilegedescalation, default no Privilege via job objects is not granularly configurable. N/A
Linux capabilities Capabilities no Windows OS has a concept of “capabilities” (referred to as “privileged constants” but they are not supported in the platform today. N/A
Restrictions that could be applied to Windows Privileged Containers Other restrictions for job objects TBD There are restrictions that could be enabled via the job object, i.e. UI restrictions N/A
Use GMSA with privileged containers GMSA – would need to implement yes Required for auth to domain controller. GA

Design Details

Overview

Windows privileged containers will be implemented with Job Objects, a break from the previous container model using server silos. Job objects provide the ability to manage a group of processes as a group, and assign resource constraints to the processes in the job. Job objects have no process or file system isolation, enabling the privileged payload to view and edit the host file system with the correct permissions, among other host resources. The init process, and any processes it launches or that are explicitly launched by the user, are all assigned to the job object of that container. When the init process exits or is signaled to exit, all the processes in the job will be signaled to exit, the job handle will be closed and the storage will be unmounted.

Because Windows privileged containers will work much differently than Linux privileged containers they will be referred to as HostProcess containers in kubernetes specs and user-facing documentation. Hopefully this will encourage users to seek documentation to better understand the capabilities and behaviors of these privileged containers.

Privileged Container Diagram

Networking

  • The container will be in the host’s network namespace (default network compartment) so it will have access to all the host’s network interfaces and have the host's IP as well.

Resource Limits

  • Resource limits (disk, memory, cpu count) will be applied to the job and will be job wide. For example, with a limit of 10 MB is set for the job, if every process in the jobs memory allocations added up exceeds 10 MB this limit would be reached. This is the same behavior as other Windows container types. These limits would be specified the same way they are currently for whatever orchestrator/runtime is being used.
  • Note: HostProcess containers will have access to nodes root filesystem. Disk limits and resource usage will only apply to the scatch volume provisioned for each HostProcess container.

Container Lifecycle

  • The container's lifecycle will be managed by the container runtime just like other Windows container types.

Container users

  • By default hostProcess containers can run one of the following system accounts:

    • NT AUTHORITY\SYSTEM
    • NT AUTHORITY\Local service
    • NT AUTHORITY\NetworkService
  • Running privileged containers as non SYSTEM/admin accounts will be the primary way operators can restrict access to system resources (files, registry, named pipes, WMI, etc).

  • To run a hostProcess container as a non SYSTEM/admin account, a local users Group must first be created on the host. Permissions to restrict access to system resources can should be configured to allow/deny access for the Group. When a new hostProcess container is created with the name of a local users Group set as the runAsUserName then a temporary user account will be created as a member of the specified group for the container to run as.

Container Mounts

There will be two different behaviors for how volume mounts are configured in hostProcess containers.

  • Bind Mounts

    • With the approach Window's bind-filter driver will be used to create a view that merges the host's OS filesystem with container-local volumes.
    • When hostProcess containers are started a new volume will be created which contains the contents of the container image. This volume will be mounted to c:\hpc.
    • Additional volume mounts specified for hostProcess containers will be mounted at their requested location and can be access the same way as volume mounts in Linux or regular Windows Server containers.
      • ex: a volume with a mountPath of /var/run/secrets/token for containers will be mounted at c:\var\run\secrets\token for containers.
    • Volume mounts will only be visible to the containers they are mounted into.
    • The default working directory for hostProcess containers will also be set to c:\hpc.
    • If a volume is mounted over a path that already exists on the host then the contents of the directory of the host, only the contents of the mounted volume will be visiable to the hostProcess container. This is the same behavior as regular Windows server behaviors.
      • A warn message will be written to the containerd logs if a volume is being mounted at a location that already exists on the host.
  • Symlinks

    • With this approach container image contents and volume mounts will be mounted at predicable paths on the host's filesystem.
    • When hostProcess containers are started a new volume will be created which containers the contents of the container image. this volume will be mounted to c:\C\{container-id}.
    • Additional volumes mounts specified for hostProcess containers will be mounted to c:\C\{container-id}\{mount-destination}
      • ex: a volume with a mountPath of /var/run/secrets/token for a container with id 1234 can be accessed at c:\C\1235\var\run\secrets\token
    • An environment variable $CONTAINER_SANDBOX_MOUNT_POINT will be set to the path where the container volume is mounted (c:\c\{container-id}) to access content.
      • This environment variable can be used inside the Pod manifest / command line / args for containers.

A recording of the behavior differences from a SIG-Windows community meeting can be found here. Note - In the recording it was mentioned that this functionality might not be supported on WS2019. This functionality will be available in WS2019 but will require an OS patch (ETA: July 2022).

Additionally the following will be true for either volume mount behavior:

  • Named Pipe mounts will not be supported. Instead named pipes should be accessed via their path on the host (\\.\pipe\*). The following error will be returned if hostProcess containers attempt to use name pipe mounts - https://github.com/microsoft/hcsshim/blob/358f05d43d423310a265d006700ee81eb85725ed/internal/jobcontainers/mounts.go#L40.
  • Unix domain sockets mounts also not not be supported for hostProces containers. Unix domain sockets can be accessed via their paths on the host like named pipes.
  • Mounting directories from the host OS into hostProcess containers will work just like with normal containers but this is not recommend. Instead workloads should access the host OS's file-system as if not being run in a container.
    • All other volume types supported for normal containers on Windows will work with hostProcess containers.
  • HostProcess Containers will have full access to the host file-system (unless restricted by filed-based ACLs and the run_as_username used to start the container).
  • There will be no chroot equivalent.
Compatibility

During the alpha/beta implementations of this feature only Symlink volume mount behavior was implemented. This implemention did unlock a lot of critical use cases for managing Windows nodes in Kubernets clusters but did have some usability issues (such as https://pkg.go.dev/k8s.io/client-go/rest#InClusterConfig not working as expected).

The bind mount volume mount behavior gives full access to the host OS's filesystem (an explicit goal of this enhancement) and addresses the usability issues with the initial approach. This approach requires the use of Windows OS APIs that were not present in Windows Server 2019 during alpha/beta implementations of this feature. These APIs will be available in WS2019 beginning in July 2022 with the monthly OS security patches. Containerd v1.7+ will be required for this behavior.

  • On containerd v1.6 symlink volume mount behavior will always be used.
  • On containerd v1.7 bind volume mount behavior will always be used.
    • Backwards compatibility with volume mount paths has been added into containerd v1.7. This means that existing workloads that used $CONTAINER_SANDBOX_MOUNT_POINT to access volume mounts will work without updates.

Container Images

  • HostProcess containers can be built on top of existing Windows base images (nanoserver, servercore, etc).
  • A new Windows container base image has been introduced for hostProcess containers. More info is available at (https://github.com/microsoft/windows-host-process-containers-base-image)
    • Note: HostProcess containers do not inherit the same compatibility requirements as process isolated containers from an OS perspective but individual container runtimes may have different image pulling/ platform matching behavior.

Container Image Build/Definition

  • HostProcess container images based on nanoserver can be built with Docker.
  • HostProcess container images based on the new base image must be built with buildkit.
  • Only a subset of dockerfile operations will be supported (ADD, COPY, PATH, ENTRYPOINT, etc).
    • Note: The subset of dockerfile operations supported for HostProcess containers is very close to the subset of operations supported when building images for other OS's with buildkit (similar to how the pause image is built in kubernetes/kubernetes)
  • Documentation on building HostProcess containers will be added at either docs.microsoft.com or a new github repository.

CRI Implementation Details

We will need to add a hostProcess field to the runtime spec. We can model this after the Linux pod security context and container security context that is a boolean that is set to true for privileged containers. References:

For Windows we are proposing the following updates to CRI-API:

Add WindowsPodSandboxConfig (and it to PodSandboxConfig)

message WindowsPodSandboxConfig {
  WindowsSandboxSecurityContext security_context = 1;
}

Add WindowsSandboxSecurityContext:

message WindowsSandboxSecurityContext {
  string run_as_username = 1;
  string credential_spec = 2;
  bool host_process = 3;
}

Update WindowsContainerSecurityContext by adding host_process field:

message WindowsContainerSecurityContext {
  string run_as_username = 1;
  string credential_spec = 2;
  bool host_process = 3;
}

Note: For alpha annotations on RunPodSandbox and CreateContainer CRI calls may be used until a version of containerd with Windows privileged container support is released.

Kubernetes API updates

WindowsSecurityContextOptions.HostProcess Flag

A new *bool field named hostProcess will be added to WindowsSecurityContextOptions.

On Windows, all containers in a pod must be privileged. Because of this behavior and because WindowsSecurityContextOptions already exists on both PodSecurityContext and Container.SecurityContext Windows containers will use this new field instead re-using the existing privileged field which only exists on SecurityContext. Additionally, the existing privileged field does not clearly describe what capabilities the container has (see kubernetes/kubernetes#44503). Documentation will be added to clearly describe what capabilities these new "hostProcess" containers have.

Current behavior applies PodSecurityContext.WindowsSecurityContextOptions settings to all Container.SecurityContext.WindowsSecurityContextOptions unless those settings are already specified on the container. To address this the following API validation will be added:

  • If PodSecurityContext.WindowsSecurityContextOptions.HostProcess = true is set to true then no container in the pod sets Container.SecurityContext.WindowsSecurityContextOptions.HostProcess = false
  • If PodSecurityContext.WindowsSecurityContextOptions.HostProcess is not set then all containers in a pod must set Container.SecurityContext.WindowsSecurityContextOptions.HostProcess = true
  • If PodSecurityContext.WindowsSecurityContextOptions.HostProcess = false no containers may set Container.SecurityContext.WindowsSecurityContextOptions.HostProcess = true
  • hostNetwork = true must be set explicits if the pod contains all hostProcess containers (this value will not be inferred and/or defaulted)

Additionally kube-apiserver will disallow hostProcess containers to be scheduled if --allow-privileged=false is passed as an argument. https://github.com/kubernetes/kubernetes/blob/release-1.20/pkg/apis/core/validation/validation.go#L5767-L5771 for reference.

Alternatives

Option 1: Re-use SecurityContext.Privileged field.

Re-using the existing SecurityContext.Privileged field was considering and here are the pros/cons considered:

Pros

  • The field already exists and many policy tools already leverage it.

Cons

  • Privileged containers on Windows will operate very differently than privileged containers on Linux. Having a new field should help avoid confusion around the differences between the two.
  • The privileged field does not have clear meaning for Linux containers today (see comments above).
  • WindowsSecurityContextOptions.RunAsUserName will the the primary way of restricting access to host/node resources (See Container users). It is desirable that RunAsUserName and HostProcess fields live on the same property.
  • API validation to ensure all containers are either privileged or not will be difficult because there is no way of definitively knowing that a pod is intended for a Windows node.

Host Network Mode

Host Network mode for privileged Windows containers will always be enabled, as the pod will automatically get the host IP.

Privileged Windows containers will be unable to align to pod namespaces due to limitations in the Windows OS. This functionality will likely be enabled in the future through a new KEP.

Because of this we will require that hostNetwork is set to true when scheduling privileged pods. This will allow existing policy tools to detect and act on privileged Windows containers without any updates. In the future if/when functionality is added to support joining privileged containers to pod networks this validation will be revisited.

Example deployment spec

Here are two examples of valid specs each containing two privileged Windows containers:

spec:
  hostNetwork: true
  securityContext:
    windowsOptions:
      hostProcess: true
  containers:
  - name: foo
    image: image1:latest
  - name: bar
    image: image2:latest
  nodeSelector:
    "kubernetes.io/os": windows
spec:
  hostNetwork: true
  containers:
  - name: foo
    image: image1:latest
    securityContext:
      windowsOptions:
        hostProcess: true
  - name: bar
    image: image2:latest
    securityContext:
      windowsOptions:
        hostProcess: true
  nodeSelector:
    "kubernetes.io/os": windows

Kubelet Implementation Details

Kubelet will pass privileged flag from WindowsSecurityContextOptions to the appropriate CRI layer calls.

Note: For alpha kubelet may add well-known annotations to CRI calls if privileged flags are set.

Add functionality to Kuberuntime_sandbox to:

The following extra validation will be added to the kubelet for Windows. These checks will ensure privileged pods work correctly on Windows if these are not validated by apiserver.

  • Ensure all containers in a pod privileged, if any are.
  • Ensure hostNetwork = true is set if pod contains privileged containers.

CRI Support Only

There are no plans to update Docker and/or dockershim to have support for privileged containers due to requirements on HCSv2. Currently containerd is the only container runtime with a Windows implementation so containerd will be required.

Validation will be added in the kubelet to fail to schedule a pod if the node is configured to use dockershim and the pod contains privileged Windows containers.

Feature Gates

Privileged container functionally on windows will be gated behind a new WindowsHostProcessContainers feature gate.

https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#feature-stages

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Alpha

  • Unit tests around validation logic for new API fields.
  • Add e2e test validating basic privileged container functionality (pod starts and run in a privileged context on the node)
  • Update Pod Security Standards doc to dissallow hostProcess containers in the baseline/default and restricted policies.

Beta

  • Validate running kubeproxy as a daemon set
  • Validate CSI-proxy running as a daemon set
  • Validate running a CNI implementation as a daemon set
  • Validate behaviors of various volume mount types as described in Container Mounts with e2e tests
  • Add e2e tests to test different ways to construct paths for container command, args, and workingDir fields for both hostProcess and non-hostProcess containers. These tests will include constructing paths with and without $CONTAINER_SANDBOX_MOUNT_POINT set and with different combinations of forward and backward slashes.

Graduation

  • Add e2e tests to validate running hostProcess containers as non SYSTEM/admin accounts
  • Update e2e tests for new volume mount behavior as described in Container Mounts

Prerequisite testing updates

No additional tests have been identified that would be required prior to implementing this enhancement.

Unit tests

  • <k8s.io/kubernetes/pkg/api/pod>: <2022-05-27> - <66.7%>
  • <k8s.io/kubernetes/pkg/apis/core>: <2022-05-27> - <78.9%>
  • <k8s.io/kubernetes/pkg/kubelet/container>: <2022-05-27> - <52.1%>
  • <k8s.io/kubernetes/pkg/kubelet/kuberuntime>: <2022-05-27> - <66.7%>
  • <k8s.io/kubernetes/pkg/securitycontext>: <2022-05-27> - <66.8%>
  • <k8s.io/cri-api/pkg/apis/runtime/v1>: <2022-05-27> - No unit test coverage - protobuf definition
  • <k8s.io/test/e2e/windows>: <2022-05-27> - No unit test coverage - this package contains e2e test code

Integration tests

It is not currently possible to test Windows specific code through existing the integration test frameworks. For this enhancement unit and e2e tests will be used for validation.

e2e tests

  • [sig-windows] [Feature:WindowsHostProcessContainers] [MinimumKubeletVersion:1.22] HostProcess containers should run as a process on the host/node: source
  • [sig-windows] [Feature:WindowsHostProcessContainers] [MinimumKubeletVersion:1.22] HostProcess containers should support init containers: source
  • [sig-windows] [Feature:WindowsHostProcessContainers] [MinimumKubeletVersion:1.22] HostProcess containers container command path validation: source
  • [sig-windows] [Feature:WindowsHostProcessContainers] [MinimumKubeletVersion:1.22] HostProcess containers metrics should report count of started and failed to start HostProcess containers: source
  • [sig-windows] [Feature:WindowsHostProcessContainers] [MinimumKubeletVersion:1.22] HostProcess containers should support various volume mount types: source

TestGrid job for Kubernetes@master - (https://testgrid.k8s.io/sig-windows-signal#capz-windows-containerd-master&include-filter-by-regex=WindowsHostProcessContainers)

k8s-triage link - (https://storage.googleapis.com/k8s-triage/index.html?job=capz&test=Feature%3AWindowsHostProcessContainers)

Graduation Criteria

Alpha plan

  • Version of containerd: Target v1.5
  • Version of Kubernetes: Target 1.22
  • OS support: Windows 2019 LTSC and all future versions of Windows Server
  • Alpha Feature Gate for passing privileged flag or annotations to CRI calls.

Graduation to Beta

  • Kubernetes Target 1.23
  • Set WindowsHostProcessContainers feature gate to beta
  • Go through PSP Linux test (e2e: validation & conformance) and make them relevant for Windows (which apply, which don't and where we need to write new tests).
  • Provide guidance similar to Pod Security Standards for Windows privileged containers.
  • CRI Support for HostProcess containers.
  • OS support: Windows 2019 LTSC and all future versions of Windows Server.
  • Documentation for HostProcess containers on https://kubernetes.io/.
    • Includes clarification around disk limits mentioned in Resource Limits.
    • Documentation on docs.microsoft.com for building HostProcess container images.
  • Update validation logic for HostProcess containers in api-server to handle ephemeral containers
    • Note: If ephemeral container is also a HostProcess container then all containers in the pod must also be HostProcess containers (and vise versa).

Graduation to GA:

  • Add documentation for running as a non-SYSTEM/admin account to k8s.io
  • Update documention on how volume mounts are set up for hostProcess containers on k8s.io
  • Set WindowsHostProcessContainers feature gate to GA
  • Provide reference images/workloads using the bind volume mounting behavior in Cluster-API-Provider-azure (which is used to run the majority of Windows e2e test passes).
  • Migrate all deployments using hostProcess containers under in the sig-windows-tools repo to be compatible with bind volume behavior.

Upgrade / Downgrade Strategy

  • Windows: This implementation requires no backports for OS components.
  • Kubernetes: No changes required outside of ensuring feature gates are set while feature is in development.
  • Containerd: Must run a version of containerd with privileged container support (targeting v1.5+).

Version Skew Strategy

N/A

Production Readiness Review Questionnaire

Feature Enablement and Rollback

This section must be completed when targeting alpha to a release.

  • How can this feature be enabled / disabled in a live cluster?

    • Feature gate (also fill in values in kep.yaml)
      • Feature gate name: WindowsHostProcessContainers
      • Components depending on the feature gate: Kubelet, kube-apiserver
    • Other
      • Describe the mechanism:
      • Will enabling / disabling the feature require downtime of the control plane?
      • Will enabling / disabling the feature require downtime or reprovisioning of a node?
  • Does enabling the feature change any default behavior? No

  • Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? This feature can be disabled. If this feature flag is disabled in kube-apiserver than new pods which try to schedule hostProcess containers will be rejected by kube-apiserver. If this flag is disabled in the kubelet then new hostProcess containers are will not be started and an appropriate event will be emitted.

  • What happens if we reenable the feature if it was previously rolled back? Newly created privileged Windows containers will run as expected.

  • Are there any tests for feature enablement/disablement? No

Rollout, Upgrade and Rollback Planning

This section must be completed when targeting beta graduation to a release.

  • How can a rollout fail? Can it impact already running workloads? Try to be as paranoid as possible - e.g., what if some components will restart mid-rollout?

  • What specific metrics should inform a rollback?

  • Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and can't do that now.

  • Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? Even if applying deprecation policies, they may still surprise some users.

Monitoring Requirements

This section must be completed when targeting beta graduation to a release.

  • How can an operator determine if the feature is in use by workloads?

    Kubelet metrics will be updated to report the number of HostProcess containers started and number of errors started.

    TBD: Confirm the best way to accomplish this is to add new values/metric labels for StartedContainersTotal and StartedContainersError counters. Otherwise we could add new counters.

  • What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

    • Metrics
      • Metric name: started_host_process_containers_total - reports the total number of host-process containers started on a given node
      • Metric name: started_host_process_containers_errors_total - reports the total number of host-process containers that have failed to start given node.
      • [Optional] Aggregation method:
      • Components exposing the metric: Kubelet
      • Notes: Both metrics were added in v1.23 and are validated with e2e tests
    • Other (treat as last resort)
      • Details:
  • What are the reasonable SLOs (Service Level Objectives) for the above SLIs? The same SLOs for starting/stopping non-hostprocess containers would apply here.

  • Are there any missing metrics that would be useful to have to improve observability of this feature? N/A

Dependencies

This section must be completed when targeting beta graduation to a release.

  • Does this feature depend on any specific services running in the cluster?

    • [ContainerD]
      • Usage description:
        • HostProcess containers support will not be added to dockershim.
        • Containerd v1.6.x is required for symlink volume mount behavior
        • Containerd v1.7+ is required for bind volume mount behavior.
        • Impact of its outage on the feature: Containers will fail to start.
        • Impact of its degraded performance or high-error rates on the feature: Containers may behave expectantly and node may go into the NotReady state.

Scalability

For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.

For beta, this section is required: reviewers must answer these questions.

For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.

  • Will enabling / using this feature result in any new API calls? No

  • Will enabling / using this feature result in introducing new API types? No

  • Will enabling / using this feature result in any new calls to the cloud provider? No

  • Will enabling / using this feature result in increasing size or count of the existing API objects? A new field is being added so API object size will grow slightly larger.

  • Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? No

  • Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? No - HostProcess containers will honor limits/reserves specified in the specs and will count against node quota just like unprivileged containers.

Troubleshooting

The Troubleshooting section currently serves the Playbook role. We may consider splitting it into a dedicated Playbook document (potentially with some monitoring details). For now, we leave it here.

This section must be completed when targeting beta graduation to a release.

  • How does this feature react if the API server and/or etcd is unavailable? This feature will not change any behaviors around Pod scheduling if API server and/or etcd is unavailable.

  • What are other known failure modes? For each of them, fill in the following information by copying the below template:

- [InClusterConfig() fails inside HostProcessContainers]
  - Causes: 
    - Due to how volume mounts for HostProcess containers are configured in containerd v1.6.X, service account tokens in the container are not present at the location expected by the golang API rest client.
  - Mitigations: 
    - If containers are using `symlink` mount behavior as described at [Compatibility](#compatibility) you can construct a kubeconfig
      file with the containers assigned service account token and use that to authenticate.
      Example: https://github.com/kubernetes-sigs/sig-windows-tools/blob/fbe00b42e2a5cca06bc182e1b6ee579bd65ed1b5/hostprocess/calico/install/calico-install.ps1#L8-L11
    - Switch to container runtime/version that supports `bind` mount behavior as as described at [Compatibility](#compatibility)
  - Diagnostics:
    - Calls to rest.InClusterConfig will fail in workloads.
  - Testing:
    - No - known limitation

- [Containers running as non-HostProcessContainers]
  - Causes:
    - Container runtime does not support HostProcessContainers
    - Bug in kubelet in some v1.23/v1.24 patch versions [#110140](https://github.com/kubernetes/kubernetes/pull/110140)
  - Detection:
    - Varies based on cause
    - Likely result will be an error in the app/workload running running inside containers
  - Mitigations: 
    - If error is caused by [#110140](https://github.com/kubernetes/kubernetes/pull/110140) then either 
      specify `PodSecurityContext.WindowsSecurityContextOptions.HostProcess=true` (instead of setting HostProcess=true on container[*].SecurityContext.WindowsSecurityContextOptions.HostProcess=true) or upgrade kubelet to a version fix for issue.
    - Provision nodes with a containerd v1.6+
  - Diagnostics:
    - Exec into a container and run `whoami` and ensure running user is as expected (ex: not ContainerUser or ContainerAdministrator for HostProcessContainers)
    - Run `kubectl get nodes -o wide` to check the container runtime and version for nodes
    - Examine container logs
    - On the node run `crictl inspectp [podid]` and ensure pod has "microsoft.com/hostprocess-container": "true" in annotation list (to detect [#110140](https://github.com/kubernetes/kubernetes/pull/110140))
    - Inspect container `trace` log messages and ensure `hostProcess=true` is set for `RunPodSandbox` calls. 
  - Testing: 
    - Yes - tests have been added to [#110140](https://github.com/kubernetes/kubernetes/pull/110140) to catch issues caused by this bug.

- [HostProcess containers fail to start with `failed to create user process token: failed to logon user: Access is denied.: unknown`]
  - Causes: 
    - Containerd is running as a user account.
      On Windows user accounts (even Administrator accounts) cannot create logon tokens for system (which can be used by HostProcessContainers).
  - Detection:
    - Metrics: **started_host_process_containers_errors_total** count increasing
    - Events: ContainerCreate failure events with reason of `failed to create user process token: failed to logon user: Access is denied.: unknown`
  - Mitigations: 
    - Run containerd as `LocalSystem` (default) or `LocalService` service accounts
  - Diagnostics: 
    - On the node run `Get-Process containerd -IncludeUserName` to see which account containerd is running as.
  - Testing: 
    - No - It is not feasible to restart the container runtime as a different user during tests passes.
  • What steps should be taken if SLOs are not being met to determine the problem?

Kubelet and/or containerd logs will need to be inspected if problems are encountered creating HostProcess containers on Windows nodes.

Implementation History

  • 2020-09-11: Issue #1981 created.
  • 2021-12-17: Initial KEP draft merged - #2037.
  • 2021-02-17: KEP approved for alpha release - #2288.
  • 2021-05-20: Alpha implementation PR merged - kubernetes/kubernetes#99576.
  • 2021-08-05: K8s 1.22 released with alpha support for WindowsHostProcessContainers feature.
  • 2021-08-21: HostProcessContainers (via CRI) support added to containerd - containerd/containerd#5131.
  • 2021-12-07: K8s 1.23 released with beta support for WindowsHostProcessContainers feature.
  • 2022-02-15: Containerd 1.6.0 relased with support for HostProcessContainers.
  • 2022-12-08: K8s 1.26 released with WindowsHostProcessContainers feature stable.

Drawbacks

Alternatives

  • Use containerd Runtimehandlers and K8s RuntimeClasses - Runtimehandlers are using the prototype. Adding the ability to the CRI provides kubelet to have more control over the security context and and fields that it allows through giving additional checks (such as runasnonroot).

  • Use annotations on CRI to pass privileged flag to containerd - Adding the field to the CRI spec allows for the existing CRI calls to work as is. The resulting code is cleaner and doesn’t rely on magic strings. There is currently a PR for adding the SecurityFields to the CRI API adding Sandbox level security support for window containers. The Runasusername will be required for privileged containers to make sure every container (including pause) runs as the correct user to limit access to the file system.

Open Questions

  • What’s the future of plug-ins that will be impacted
  • What will be the future CSI-proxy and other plug-ins that will be impacted?
    • CSI-proxy and HNS-proxy are likely to be impacted
  • Container base image support
    • Is “from scratch” required
    • Would a slimmer “privileged base image” be more desirable than using stand server core
  • Container image build differences with traditional windows server and impacts on image use and distribution
  • Should PSP be updated with latest checks or should out-of-tree enforcement tool be used?
    • PSP will be depreciated and documentation and guidance should be produced for Security Standards. Implementations in out-of-tree enforcement should be favored and POC/impementation in gatekeeper would be a great way to demonstrate.
  • Scheduling checks
  • Privileged containers in the same network compartment as the non-privileged pod, and otherwise init privileged containers may be able to still access the host network