Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design proposal for Windows Container Configuration in CRI #1510

Merged
merged 6 commits into from
Feb 2, 2018
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions contributors/design-proposals/node/cri-windows.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# CRI: Windows Container Configuration

**Authors**: Jiangtian Li (@JiangtianLi), Pengfei Ni (@feiskyer), Patrick Lang(@PatrickLang)

**Status**: Proposed

## Background
Container Runtime Interface (CRI) defines [APIs and configuration types](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto) for kubelet to integrate various container runtimes. The Open Container Initiative (OCI) Runtime Specification defines [platform specific configuration](https://github.com/opencontainers/runtime-spec/blob/master/config.md#platform-specific-configuration), including Linux, Windows, and Solaris. Currently CRI only suppports Linux container configuration. This proposal is to bring the Memory & CPU resource restrictions already specified in OCI for Windows to CRI.

The Linux & Windows schedulers differ in design and the units used, but can accomplish the same goal of limiting resource consumption of individual containers.

For example, on Linux platform, cpu quota and cpu period represent CPU resource allocation to tasks in a cgroup and cgroup by [Linux kernel CFS scheduler](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt). Container created in the cgroup are subject to those limitations, and additional processes forked or created will inherit the same cgroup.

On the Windows platform, processes may be assigned to a job object, which can have [CPU rate control information](https://msdn.microsoft.com/en-us/library/windows/desktop/hh448384(v=vs.85).aspx), memory, and storage resource constraints enforced by the Windows kernel scheduler. A job object is created by Windows to at container creation time so all processes in the container will be aggregated and bound to the resource constraint.

## Umbrella Issue
[#56734](https://github.com/kubernetes/kubernetes/issues/56734)

## Motivation
The goal is to start filling the gap of platform support in CRI, specifically for Windows platform. For example, currrently in dockershim Windows containers are scheduled using the default resource constraints and does not respect the resource requests and limits specified in POD. With this proposal, Windows containers will be able to leverage POD spec and CRI to allocate compute resource and respect restriction.

## Proposed design

The design is faily straightforward and to align CRI container configuration for Windows with [OCI runtime specification](https://github.com/opencontainers/runtime-spec/blob/master/specs-go/config.go):
```
// WindowsResources has container runtime resource constraints for containers running on Windows.
type WindowsResources struct {
// Memory restriction configuration.
Memory *WindowsMemoryResources `json:"memory,omitempty"`
// CPU resource restriction configuration.
CPU *WindowsCPUResources `json:"cpu,omitempty"`
}
```

Since Storage and Iops for Windows containers is optional, it can be postponed to align with Linux container configuration in CRI. Therefore we propose to add the following to CRI for Windows container (PR [here](https://github.com/kubernetes/kubernetes/pull/57076)).

### API definition
```
// WindowsContainerConfig contains platform-specific configuration for
// Windows-based containers.
message WindowsContainerConfig {
// Resources specification for the container.
WindowsContainerResources resources = 1;
}

// WindowsContainerResources specifies Windows specific configuration for
// resources.
message WindowsContainerResources {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must miss something here. What are the between this: WindowsContainerResources and above API: WindowsResources?

From the explanation given below, what are defined here are the values passing to windows runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WindowsResources is from OCI spec (https://github.com/opencontainers/runtime-spec/blob/master/specs-go/config.go). It is essentially the same as WindowsContainerResources in CRI here.

// CPU shares (relative weight vs. other containers). Default: 0 (not specified).
int64 cpu_shares = 1;
// Number of CPUs available to the container. Default: 0 (not specified).
int64 cpu_count = 2;
// Specifies the portion of processor cycles that this container can use as a percentage times 100.
int64 cpu_maximum = 3;
// Memory limit in bytes. Default: 0 (not specified).
int64 memory_limit_in_bytes = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example of how kubernetes cpu/memory requests/limit would map to these fields?

Also, are there any significant difference in these configurations and how kernel reacts to the conditions between linux and windows?

Copy link
Member

@feiskyer feiskyer Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example of how kubernetes cpu/memory requests/limit would map

// Refer https://github.com/moby/moby/blob/master/daemon/oci_windows.go#L265
// and https://github.com/Microsoft/hcsshim/blob/master/interface.go#L77.
CpuCount = int((container.Resources.Limits.Cpu().MilliValue() + 1000)/1000) // 0 if not set
// milliCPUToShares converts milliCPU to 0-10000
CpuShares = milliCPUToShares(container.Resources.Limits.Cpu().MilliValue())
if CpuShares == 0 {
    CpuShares = milliCPUToShares(container.Resources.Request.Cpu().MilliValue())
}
CpuMaximum =  container.Resources.Limits.Cpu().MilliValue()/sysinfo.NumCPU()/1000*10000
if isHyperV {
   CpuMaximum = container.Resources.Limits.Cpu().MilliValue()/CpuCount/1000*10000
}
MemoryLimitInBytes = container.Resources.Limits.Memory().Value() 

are there any significant difference in these configurations and how kernel reacts to the conditions between linux and windows?

Those CPU parameters are different from Linux, and there is no CFS on windows. this MSDN documentation explains how cpu/memory resource is controlled for Windows containers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to clarify - CpuMaximum is a limits that can't be exceeded. If the node is overprovisioned and under contention for multiple containers operating below their limit, then cycles will be weighted based on shares. If the system isn't overprovisioned then shares has no effect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for explaining. Those are the things that I wanted to see in this proposal.

How does CpuCount interact with CpuMaximum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. I'll add to the proposal.
From https://github.com/docker/docker-ce/blob/master/components/cli/man/docker-run.1.md,
On Windows Server containers, the processor resource controls are mutually exclusive, the order of precedence is CPUCount first, then CPUShares, and CPUPercent last.
CPUMaximum is derived from CPUCount for HyperV container, and from sysinfo.NumCPU() for process container.

Copy link

@darstahl darstahl Jan 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that for Hyper-V containers these controls are not mutually exclusive. The CPUCount controls the number of virtual processors that the container has access to (and by extension, the host's logical processors available for execution). The CpuMaximum applies to each of these virtual processors independently, for example, CpuCount=2,CpuMaximum=5000 (50%) would limit each CPU to 50%. Running a single threaded application that uses as much CPU as available on the above configuration would be able to use at most 50% of a single host core.

As mentioned above, for Windows Server containers (process isolation) CpuCount would take precedence (and the cpu limit would be simulated based on the provided value compared to the number of host CPUs using JOB_OBJECT_CPU_RATE_CONTROL_HARD_CAP), and CpuMaximum would be ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
```

### Mapping from Kubernetes API ResourceRequirements to Windows Container Resources
[Kubernetes API ResourceRequirements](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.9/#resourcerequirements-v1-core) contains two fields: limits and requests. Limits describes the maximum amount of compute resources allowed. Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value.

Windows Container Resources defines [resource control for Windows containers](https://docs.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/resource-controls). Note resource control is different between Hyper-V container (Hyper-V isolation) and Windows Server container (process isolation). Windows containers utilize job objects to group and track processes associated with each container. Resource controls are implemented on the parent job object associated with the container. In the case of Hyper-V isolation resource controls are applied both to the virtual machine as well as to the job object of the container running inside the virtual machine automatically, this ensures that even if a process running in the container bypassed or escaped the job objects controls the virtual machine would ensure it was not able to exceed the defined resource controls.

[CPUCount](https://github.com/Microsoft/hcsshim/blob/master/interface.go#L76) specifies number of processors to assign to the container. [CPUShares](https://github.com/Microsoft/hcsshim/blob/master/interface.go#L77) specifies relative weight to other containers with cpu shares. Range is from 1 to 10000. [CPUMaximum or CPUPercent](https://github.com/Microsoft/hcsshim/blob/master/interface.go#L78) specifies the portion of processor cycles that this container can use as a percentage times 100. Range is from 1 to 10000. On Windows Server containers, the processor resource controls are mutually exclusive, the order of precedence is CPUCount first, then CPUShares, and CPUPercent last (refer to [Docker User Manuals](https://github.com/docker/docker-ce/blob/master/components/cli/man/docker-run.1.md)). On Hyper-V containers, CPUMaximum applies to each processor independently, for example, CPUCount=2, CPUMaximum=5000 (50%) would limit each CPU to 50%.

The mapping of resource limits/requests to Windows Container Resources is in the following table (refer to [Docker's conversion to OCI spec](https://github.com/moby/moby/blob/master/daemon/oci_windows.go#L265-#L289)):

| | Windows Server Container | Hyper-V Container |
| ------------- |:-------------------------|:-----------------:|
| cpu_count | `cpu_count = int((container.Resources.Limits.Cpu().MilliValue() + 1000)/1000)` <br> `// 0 if not set` | Same |
| cpu_shares | `// milliCPUToShares converts milliCPU to 0-10000` <br> `cpu_shares=milliCPUToShares(container.Resources.Limits.Cpu().MilliValue())` <br> `if cpu_shares == 0 {` <br>&nbsp;&nbsp;&nbsp;&nbsp;`cpu_shares=milliCPUToShares(container.Resources.Request.Cpu().MilliValue())` <br> `}` | Same |
| cpu_maximum | `container.Resources.Limits.Cpu().MilliValue()/sysinfo.NumCPU()/1000*10000` | `container.Resources.Limits.Cpu().MilliValue()/cpu_count/1000*10000` |
| memory_limit_in_bytes | `container.Resources.Limits.Memory().Value()` | Same |
|||


## Implementation
The implementation will mainly be in two parts:
* In kuberuntime, where configuration is generated from POD spec.
* In container runtime, where configuration is passed to container configuration. For example, in dockershim, passed to [HostConfig](https://github.com/moby/moby/blob/master/api/types/container/host_config.go).

In both parts, we need to implement:
* Fork code for Windows from Linux.
* Convert from Resources.Requests and Resources.Limits to Windows configuration in CRI, and convert from Windows configration in CRI to container configuration.

To implement resource controls for Windows containers, refer to [this MSDN documentation](https://docs.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/resource-controls) and [Docker's conversion to OCI spec](https://github.com/moby/moby/blob/master/daemon/oci_windows.go).

## Future work

Windows [storage resource controls](https://github.com/opencontainers/runtime-spec/blob/master/config-windows.md#storage), security context (analog to SELinux, Apparmor, readOnlyRootFilesystem, etc.) and pod resource controls (analog to LinuxPodSandboxConfig.cgroup_parent already in CRI) are under investigation and would be handled in separate propsals. They will supplement and not replace the fields in `WindowsContainerResources` from this proposal.