Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Intel RDT/CAT support for OCI/runc and Docker #433

Closed
xiaochenshen opened this issue Dec 11, 2015 · 25 comments
Closed

Proposal: Intel RDT/CAT support for OCI/runc and Docker #433

xiaochenshen opened this issue Dec 11, 2015 · 25 comments

Comments

@xiaochenshen
Copy link
Contributor

xiaochenshen commented Dec 11, 2015

The descriptions of Intel RDT/CAT features, user cases and Linux kernel interface are
heavily based on the Intel RDT documentation of the Linux kernel:

https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

Thanks to the authors of the kernel patches:
* Vikas Shivappa <vikas.shivappa@linux.intel.com>
* Fenghua Yu <fenghua.yu@intel.com>
* Tony Luck <tony.luck@intel.com>

Status: Intel RDT/CAT support for OCI and Docker software stack

Intel RDT/CAT support in OCI (merged PRs):

1. Intel RDT/CAT support in OCI/runtime-spec

opencontainers/runtime-spec#630
opencontainers/runtime-spec#787
opencontainers/runtime-spec#889
opencontainers/runtime-spec#988

2. Intel RDT/CAT support in OCI/runc

#1279
#1589
#1590
#1615
#1894
#1913
#1930
#1955
#2042

TODO list - Intel RDT/CAT support in Docker:

3. Intel RDT/CAT support in containerd

4. Intel RDT/CAT support in Docker Engine (moby/moby)

5. Intel RDT/CAT support in Docker CLI


What is Intel RDT and CAT:

Intel Cache Allocation Technology (CAT) is a sub-feature of Resource Director Technology (RDT). Currently L3 Cache is the only resource that is supported in RDT.

Cache Allocation Technology offers the capability of L3 cache Quality of Service (QoS). It provides a way for the Software (OS/VMM/Container) to restrict cache allocation to a defined 'subset' of cache which may be overlapping with other 'subsets'. This feature is used when allocating a line in cache i.e. when pulling new data into the cache. The programming of the h/w is done via PQR MSRs.

The different cache subsets are identified by CLOS (class of service) identifier and each CLOS has a CBM (cache bit mask). The CBM is a contiguous set of bits which defines the amount of cache resource that is available for each 'subset'.

More information can be found in the section 17.17 of Intel Software Developer Manual and Intel RDT Homepage.

Supported Intel Xeon CPU SKUs:

  • Intel(R) Xeon(R) processor E5 v4 and newer generations
  • Intel(R) Xeon(R) processor D
  • Intel(R) Xeon(R) processor E5 v3 (limited support)

To check if cache allocation was enabled:
$ cat /proc/cpuinfo
Check if output have 'rdt_a' and 'cat_l3' flags.

Why is Cache Allocation needed:

Cache Allocation Technology is useful in managing large computer server systems with large size L3 cache, in the cloud and container context. Examples may be large servers running instances of webservers or database servers. In such complex systems, these subsets can be used for more careful placing of the available cache resources by a centralized root accessible interface.

The architecture also allows dynamically changing these subsets during runtime to further optimize the performance of the higher priority application with minimal degradation to the low priority app. Additionally, resources can be rebalanced for system throughput benefit.

User cases for container:

Figure 1:
cat_case

Note: Figure 1 is fetched from section 17.17 of Intel Software Developer Manual.

Currently the Last Level Cache (LLC) in Intel Xeon platforms is L3 cache. So LLC == L3 cache here.

Noisy neighbor issue:

A typical use case is to solve the noisy neighbor issue in container environment. For example, when a streaming application which running in a container is constantly copying data and accessing linear space larger than L3 cache, and hence evicting a large amount of cache which could have otherwise been used by a higher priority computing application which running in another container.

Using the cache allocation feature, the 'noisy neighbors' container which running the streaming application can be confined to use a smaller cache, and the higher priority application be awarded a larger amount of L3 cache space.

L3 cache QoS:

Another key user scenario is in large-scale container clusters context. A central scheduler or orchestrator would control resource allocations to a set of containers. Docker and runc can make use of libcontainer to manage resources. They could benefit from Intel RDT cache allocation feature for new resource constraints. We could define different cache subsets strategies through setting different CLOS/CBM in containers' runtime configuration. As a result, we could achieve fine-grained L3 cache QoS (quality of service) among containers.

Linux kernel interface for Intel RDT/CAT:

In Linux 4.10 kernel and newer, Intel RDT/CAT will be supported with kernel config CONFIG_INTEL_RDT_A. In Linux 5.1 kernel and newer, with kernel config CONFIG_X86_CPU_RESCTRL.

Originally, the kernel interface for Intel RDT/CAT is intel_rdt cgroup, but the cgroup solution is rejected by kernel cgroup maintainer for some reasons, such as incompatibility with cgroup hierarchy, limitations for some corner cases and etc.

Currently, a new kernel interface is defined and exposed via "resource control" filesystem, which is a "cgroup-like" interface. The new design aligns better with the hardware capabilities provided, and addresses the issues in cgroup based interface.

Comparing with cgroups, the interface has similar process management lifecycle and interfaces in a container. But unlike cgroups' hierarchy, it has single level filesystem layout.

Intel RDT "resource control" filesystem hierarchy:

mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|       |-- cbm_mask
|       |-- min_cbm_bits
|       |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemata
    |-- tasks

For runc, we can make use of tasks and schemata configuration for L3 cache resource constraints.

The file tasks has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. If a pid is not in any sub group, it is in root group.

The file schemata has allocation bitmasks/values for L3 cache on each socket, which contains L3 cache id and capacity bitmask (CBM).

    Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."

For example, on a two-socket machine, L3's schema line could be L3:0=ff;1=c0 which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a contiguous bits set and number of bits that can be set is less than the max bit. The max bits in the CBM is varied among supported Intel Xeon platforms. In Intel RDT "resource control" filesystem layout, the CBM in a group should be a subset of the CBM in root. Kernel will check if it is valid when writing. e.g., 0xfffff in root indicates the max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:

Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:

"linux": {
	"intelRdt": {
		"l3CacheSchema": "L3:0=ffff0;1=3ff"
	}
}

Proposal and design - components:

1. Intel RDT/CAT support in OCI/runtime-spec:

Status: PR opencontainers/runtime-spec#630 has been merged.
This is the prerequisite of this proposal.

  • Add Intel RDT/CAT L3 cache resources in Linux-specific configuration to support config.json.

2. Intel RDT/CAT support in OCI/runc

Status: PR #1279 has been merged.
This is the prerequisite of this proposal. It mainly focused on Intel RDT/CAT infrastructure support in runc/libcontainer:

  • Add package intelrdt as a new infrastructure in libcontainer. It implements IntelRdtManager interface to handle intelrdt framework:
    • Apply()
    • GetStats()
    • Destroy()
    • GetPath()
    • Set()
  • Add intelRdtManager in linuxContainer struct, and invoke Intel RDT/CAT operations in process management (initProcess, setnsProcess) functions:
    • Apply()
    • GetStats()
    • Destroy()
    • GetPath()
    • Set()
  • Add IntelRdtManager hook function to configure a LinuxFactory to return containers which could create and manage Intel RDT/CAT L3 cache resources:
    • loadFactory()
    • Create()
    • Load()
  • Add Intel RDT/CAT entries in libcontainer/configs.
  • Add runtime-spec configuration handler in CreateLibcontainerConfig().
  • Add Intel RDT/CAT stats metrics in libcontainer/intelrdt.
  • Add Intel RDT/CAT unit test cases in libcontainer/intelrdt.
  • Add Intel RDT/CAT test cases in libcontainer.
  • Add runc documentations for Intel RDT/CAT.

TODO list - Intel RDT/CAT support in Docker

3. Intel RDT/CAT support in containerd

4. Intel RDT/CAT support in Docker Engine (moby/moby)

5. Intel RDT/CAT support in Docker CLI

When Intel RDT/CAT is ready in libcontainer, Docker could naturally make use of libcontainer to support L3 cache allocation for container resource management. Some potential work to do in Docker:

  • Add docker run options to support Intel RDT/CAT.
  • Add Docker configuration file options to support Intel RDT/CAT.
  • Add docker client/daemon APIs to support Intel RDT/CAT.
  • Add Intel RDT/CAT functions in containerd.
  • Add Intel RDT/CAT functions in docker engine.
  • Add Intel RDT/CAT L3 cache metrics in docker stats.
  • Add Docker documentations for Intel RDT/CAT.

TODO list - Intel RDT/CDP support in runc

As a specialized extension of CAT, Code and Data Prioritization (CDP) enables separate control over code and data placement in the L3 cache. Certain specialized types of workloads may benefit with increased runtime determinism, enabling greater predictability in application performance.

The Linux kernel CDP patch is part of CAT patch series. We can also add the functionality in runc.

Obsolete design which based on cgroup interface (for backup only)

The following content is kept only for reference. The original design based on kernel cgroup interface will be obsolete for kernel cgroup interface patch is rejected.

### L3 cache QoS through cgroup:
Another key user scenario is in large-scale container clusters context. A central scheduler or orchestrator would control resource allocations to a set of containers. In today's resource management, cgroups are widely used and a significant amount of plumbing in user space is already done to perform tasks like allocating and configuring resources dynamically and statically.

Docker and runc are using cgroups interface via libcontainer to manage resources. They could benefit from cache allocation feature for cgroup interface is an easily adaptable interface for L3 cache allocation. We could define different cache subsets strategies through setting different CLOS/CBM in containers' runtime configuration. As a result, we could achieve fine-grained L3 cache QoS (quality of service) among containers.

## Linux Kernel intel_rdt cgroup interface:

In Linux kernel 4.6 (or later), new cgroup subsystem 'intel_rdt' *will be* added soon with kernel config CONFIG_INTEL_RDT.
The latest [Intel Cache Allocation Technology (CAT) kernel patch] [2]:
[2]: https://lkml.org/lkml/2015/10/2/72
https://lkml.org/lkml/2015/10/2/72

The different L3 cache subsets are identified by CLOS identifier (class of service) and each CLOS has a CBM (cache bit mask). The CBM is a contiguous set of bits which defines the amount of cache resource that is available for each 'subset'.

The max CBM, which mapping to entire L3 cache, is indicated by *intel_rdt.l3cbm* in the root node. The value is varied among different supported Intel platforms (for example, intel_rdt.l3_cbm == 0xfffff means the max CBM is 20 bits). The *intel_rdt.l3_cmb* in any child cgroup is inherited from parent cgroup by default, and it can be changed by user later.

Say if the L3 cache is 55 MB and max CBM is 20 bits. This assigns 11 MB (1/5) of L3 cache to group1 and group2 which is exclusive between them.

$ mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/intel_rdt
$ cd /sys/fs/cgroup/intel_rdt
$ mkdir group1
$ mkdir group2
$ cd group1
$ /bin/echo 0xf > intel_rdt.l3_cbm
$ cd group2
$ /bin/echo 0xf0 > intel_rdt.l3_cbm

Assign tasks to the group1 and group2:

$ /bin/echo PID1 > tasks
$ /bin/echo PID2 > tasks

Linux kernel cgroup infrastructure also supports mounting cpuset and intel_rdt cgroup subsystems together. We can configure L3 cache allocation to align CPU affinity per-cores or per-socket.

Intel_rdt cgroup has zero or minimal overhead in hot path on following unsupported cases:
* Linux kernel patch doesn't exist on any non-intel platforms.
* On Intel platforms, this could not exist by default unless INTEL_RDT is enabled.
* Remains a no-op when INTEL_RDT is enabled but Intel hardware does not support the feature.

## intel_rdt cgroup support in github.com/opencontainers/specs

This is the prerequisite of this proposal. I will open a new issue or pull request in *github.com/opencontainers/specs* soon.
* Add intel_rdt cgroup resources in Linux-specific runtime configuration to support ```runtime.json```.

## intel_rdt cgroup support in runc/libcontainer

This proposal is mainly focused on intel_rdt cgroup infrastructure in libcontainer:

* Add *package IntelRdtGroup* to implement new subsystem interface in libcontainer/cgroups/fs:
  * *Name()*
  * *GetStats()*
  * *Remove()*
  * *Apply()*
  * *Set()*
* Add *IntelRdtGroup* in cgroup subsystemSet in libcontainer/cgroups/fs.
* Add intel_rdt cgroup unit tests in libcontainer/cgroups/fs.
* Add intel_rdt cgroup stats metrics in libcontainer/cgroup.
* Add systemd cgroup specific functions in libcontainer/cgroups/systemd.
  * Add *IntelRdtGroup* in cgroup subsystemSet in libcontainer/cgroups/systemd.
  * Add *joinIntelRdt()* function in *Apply()*
* Add intel_rdt cgroup entries in libcontainer/configs.
* Add intel_rdt libcontainer integration tests in libcontainer/integration.
* Add runc documentations for intel_rdt cgroup.

## intel_rdt cgroup support in Docker (TODO)
When intel_rdt cgroup is ready in libcontainer, naturally, Docker could make use of libcontainer as native execution driver to support L3 cache allocation for container resource management.

Some potential work to do in Docker in future:
* Add *docker run* options to support intel_rdt.
* Add *docker client/daemon* APIs to support intel_rdt.
* Add intel_rdt functions in *docker execution driver*.
* Add intel_rdt cgroup metrics in *docker stats*.
* Add Docker documentations for intel_rdt cgroup.
@mrunalp
Copy link
Contributor

mrunalp commented Dec 17, 2015

Sure, we can add the support to runc once the spec PR is merged.

@vishh
Copy link
Contributor

vishh commented Dec 22, 2015

+1

xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Dec 24, 2015
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

More information can be found in the section 17.16 of Intel Software Developer
Manual.

About intel_rdt cgroup:
Linux kernel 4.6 (or later) will introduce new cgroup subsystem 'intel_rdt'
with kernel config CONFIG_INTEL_RDT.

The 'intel_rdt' cgroup manages L3 cache allocation. It has a file 'l3_cbm'
which represents the L3 cache capacity bitmask (CBM). The CBM needs to have
only *contiguous bits set* and number of bits that can be set is less than the
max bits. The max bits in the CBM is varied among supported Intel platforms.
The tasks belonging to a cgroup get to fill in the L3 cache represented by
the CBM. For example, if the max bits in the CBM is 10 and the L3 cache size
is 10MB, each bit represents 1MB of the L3 cache capacity.

Root cgroup always has all the bits set in the l3_cbm. User can create more
cgroups with mkdir syscall. By default the child cgroups inherit the CBM from
parent. User can change the CBM specified in hex for each cgroup.

For more information about intel_rdt cgroup:
https://lkml.org/lkml/2015/10/2/74

An example:
    Root cgroup: intel_rdt.l3_cbm == 0xfffff, the max CBM is 20 bits
    L3 cache size: 55 MB
This assigns 11 MB (1/5) of L3 cache to the child group:
    $ /bin/echo 0xf > intel_rdt.l3_cbm

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Dec 24, 2015
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

More information can be found in the section 17.16 of Intel Software Developer
Manual.

About intel_rdt cgroup:
Linux kernel 4.6 (or later) will introduce new cgroup subsystem 'intel_rdt'
with kernel config CONFIG_INTEL_RDT.

The 'intel_rdt' cgroup manages L3 cache allocation. It has a file 'l3_cbm'
which represents the L3 cache capacity bitmask (CBM). The CBM needs to have
only *contiguous bits set* and number of bits that can be set is less than the
max bits. The max bits in the CBM is varied among supported Intel platforms.
The tasks belonging to a cgroup get to fill in the L3 cache represented by
the CBM. For example, if the max bits in the CBM is 10 and the L3 cache size
is 10MB, each bit represents 1MB of the L3 cache capacity.

Root cgroup always has all the bits set in the l3_cbm. User can create more
cgroups with mkdir syscall. By default the child cgroups inherit the CBM from
parent. User can change the CBM specified in hex for each cgroup.

For more information about intel_rdt cgroup:
https://lkml.org/lkml/2015/10/2/74

An example:
    Root cgroup: intel_rdt.l3_cbm == 0xfffff, the max CBM is 20 bits
    L3 cache size: 55 MB
This assigns 11 MB (1/5) of L3 cache to the child group:
    $ /bin/echo 0xf > intel_rdt.l3_cbm

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Dec 24, 2015
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

More information can be found in the section 17.16 of Intel Software Developer
Manual.

About intel_rdt cgroup:
Linux kernel 4.6 (or later) will introduce new cgroup subsystem 'intel_rdt'
with kernel config CONFIG_INTEL_RDT.

The 'intel_rdt' cgroup manages L3 cache allocation. It has a file 'l3_cbm'
which represents the L3 cache capacity bitmask (CBM). The CBM needs to have
only *contiguous bits set* and number of bits that can be set is less than the
max bits. The max bits in the CBM is varied among supported Intel platforms.
The tasks belonging to a cgroup get to fill in the L3 cache represented by
the CBM. For example, if the max bits in the CBM is 10 and the L3 cache size
is 10MB, each bit represents 1MB of the L3 cache capacity.

Root cgroup always has all the bits set in the l3_cbm. User can create more
cgroups with mkdir syscall. By default the child cgroups inherit the CBM from
parent. User can change the CBM specified in hex for each cgroup.

For more information about intel_rdt cgroup:
https://lkml.org/lkml/2015/10/2/74

An example:
    Root cgroup: intel_rdt.l3_cbm == 0xfffff, the max CBM is 20 bits
    L3 cache size: 55 MB
This assigns 11 MB (1/5) of L3 cache to the child group:
    $ /bin/echo 0xf > intel_rdt.l3_cbm

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Dec 24, 2015
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

More information can be found in the section 17.16 of Intel Software Developer
Manual.

About intel_rdt cgroup:
Linux kernel 4.6 (or later) will introduce new cgroup subsystem 'intel_rdt'
with kernel config CONFIG_INTEL_RDT.

The 'intel_rdt' cgroup manages L3 cache allocation. It has a file 'l3_cbm'
which represents the L3 cache capacity bitmask (CBM). The CBM needs to have
only *contiguous bits set* and number of bits that can be set is less than the
max bits. The max bits in the CBM is varied among supported Intel platforms.
The tasks belonging to a cgroup get to fill in the L3 cache represented by
the CBM. For example, if the max bits in the CBM is 10 and the L3 cache size
is 10MB, each bit represents 1MB of the L3 cache capacity.

Root cgroup always has all the bits set in the l3_cbm. User can create more
cgroups with mkdir syscall. By default the child cgroups inherit the CBM from
parent. User can change the CBM specified in hex for each cgroup.

For more information about intel_rdt cgroup:
https://lkml.org/lkml/2015/10/2/74

An example:
    Root cgroup: intel_rdt.l3_cbm == 0xfffff, the max bits of CBM is 20
    L3 cache size: 55 MB
This assigns 11 MB (1/5) of L3 cache to the child group:
    $ /bin/echo 0xf > intel_rdt.l3_cbm

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Dec 26, 2015
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

More information can be found in the section 17.16 of Intel Software Developer
Manual.

About intel_rdt cgroup:
Linux kernel 4.6 (or later) will introduce new cgroup subsystem 'intel_rdt'
with kernel config CONFIG_INTEL_RDT.

The 'intel_rdt' cgroup manages L3 cache allocation. It has a file 'l3_cbm'
which represents the L3 cache capacity bitmask (CBM). The CBM needs to have
only *contiguous bits set* and number of bits that can be set is less than the
max bits. The max bits in the CBM is varied among supported Intel platforms.
The tasks belonging to a cgroup get to fill in the L3 cache represented by
the CBM. For example, if the max bits in the CBM is 10 and the L3 cache size
is 10MB, each bit represents 1MB of the L3 cache capacity.

Root cgroup always has all the bits set in the l3_cbm. User can create more
cgroups with mkdir syscall. By default the child cgroups inherit the CBM from
parent. User can change the CBM specified in hex for each cgroup.

For more information about intel_rdt cgroup:
https://lkml.org/lkml/2015/10/2/74

An example:
    Root cgroup: intel_rdt.l3_cbm == 0xfffff, the max bits of CBM is 20
    L3 cache size: 55 MB
This assigns 11 MB (1/5) of L3 cache to the child group:
    $ /bin/echo 0xf > intel_rdt.l3_cbm

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Feb 14, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

More information can be found in the section 17.16 of Intel Software Developer
Manual.

About intel_rdt cgroup:
Linux kernel 4.6 (or later) will introduce new cgroup subsystem 'intel_rdt'
with kernel config CONFIG_INTEL_RDT.

The 'intel_rdt' cgroup manages L3 cache allocation. It has a file 'l3_cbm'
which represents the L3 cache capacity bitmask (CBM). The CBM needs to have
only *contiguous bits set* and number of bits that can be set is less than the
max bits. The max bits in the CBM is varied among supported Intel platforms.
The tasks belonging to a cgroup get to fill in the L3 cache represented by
the CBM. For example, if the max bits in the CBM is 10 and the L3 cache size
is 10MB, each bit represents 1MB of the L3 cache capacity.

Root cgroup always has all the bits set in the l3_cbm. User can create more
cgroups with mkdir syscall. By default the child cgroups inherit the CBM from
parent. User can change the CBM specified in hex for each cgroup.

For more information about intel_rdt cgroup:
https://lkml.org/lkml/2015/12/17/574

An example:
    Root cgroup: intel_rdt.l3_cbm == 0xfffff, the max bits of CBM is 20
    L3 cache size: 55 MB
This assigns 11 MB (1/5) of L3 cache to the child group:
    $ /bin/echo 0xf > intel_rdt.l3_cbm

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Feb 14, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

More information can be found in the section 17.16 of Intel Software Developer
Manual.

About intel_rdt cgroup:
Linux kernel 4.6 (or later) will introduce new cgroup subsystem 'intel_rdt'
with kernel config CONFIG_INTEL_RDT.

The 'intel_rdt' cgroup manages L3 cache allocation. It has a file 'l3_cbm'
which represents the L3 cache capacity bitmask (CBM). The CBM needs to have
only *contiguous bits set* and number of bits that can be set is less than the
max bits. The max bits in the CBM is varied among supported Intel platforms.
The tasks belonging to a cgroup get to fill in the L3 cache represented by
the CBM. For example, if the max bits in the CBM is 10 and the L3 cache size
is 10MB, each bit represents 1MB of the L3 cache capacity.

Root cgroup always has all the bits set in the l3_cbm. User can create more
cgroups with mkdir syscall. By default the child cgroups inherit the CBM from
parent. User can change the CBM specified in hex for each cgroup.

For more information about intel_rdt cgroup:
https://lkml.org/lkml/2015/12/17/574

An example:
    Root cgroup: intel_rdt.l3_cbm == 0xfffff, the max bits of CBM is 20
    L3 cache size: 55 MB
This assigns 11 MB (1/5) of L3 cache to the child group:
    $ /bin/echo 0xf > intel_rdt.l3_cbm

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Feb 14, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

More information can be found in the section 17.16 of Intel Software Developer
Manual.

About intel_rdt cgroup:
Linux kernel 4.6 (or later) will introduce new cgroup subsystem 'intel_rdt'
with kernel config CONFIG_INTEL_RDT.

The 'intel_rdt' cgroup manages L3 cache allocation. It has a file 'l3_cbm'
which represents the L3 cache capacity bitmask (CBM). The CBM needs to have
only *contiguous bits set* and number of bits that can be set is less than the
max bits. The max bits in the CBM is varied among supported Intel platforms.
The tasks belonging to a cgroup get to fill in the L3 cache represented by
the CBM. For example, if the max bits in the CBM is 10 and the L3 cache size
is 10MB, each bit represents 1MB of the L3 cache capacity.

Root cgroup always has all the bits set in the l3_cbm. User can create more
cgroups with mkdir syscall. By default the child cgroups inherit the CBM from
parent. User can change the CBM specified in hex for each cgroup.

For more information about intel_rdt cgroup:
https://lkml.org/lkml/2015/12/17/574

An example:
    Root cgroup: intel_rdt.l3_cbm == 0xfffff, the max bits of CBM is 20
    L3 cache size: 55 MB
This assigns 11 MB (1/5) of L3 cache to the child group:
    $ /bin/echo 0xf > intel_rdt.l3_cbm

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Feb 14, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

More information can be found in the section 17.16 of Intel Software Developer
Manual.

About intel_rdt cgroup:
Linux kernel 4.6 (or later) will introduce new cgroup subsystem 'intel_rdt'
with kernel config CONFIG_INTEL_RDT.

The 'intel_rdt' cgroup manages L3 cache allocation. It has a file 'l3_cbm'
which represents the L3 cache capacity bitmask (CBM). The CBM needs to have
only *contiguous bits set* and number of bits that can be set is less than the
max bits. The max bits in the CBM is varied among supported Intel platforms.
The tasks belonging to a cgroup get to fill in the L3 cache represented by
the CBM. For example, if the max bits in the CBM is 10 and the L3 cache size
is 10MB, each bit represents 1MB of the L3 cache capacity.

Root cgroup always has all the bits set in the l3_cbm. User can create more
cgroups with mkdir syscall. By default the child cgroups inherit the CBM from
parent. User can change the CBM specified in hex for each cgroup.

For more information about intel_rdt cgroup:
https://lkml.org/lkml/2015/12/17/574

An example:
    Root cgroup: intel_rdt.l3_cbm == 0xfffff, the max bits of CBM is 20
    L3 cache size: 55 MB
This assigns 11 MB (1/5) of L3 cache to the child group:
    $ /bin/echo 0xf > intel_rdt.l3_cbm

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
@xiaochenshen
Copy link
Contributor Author

@oci maintainers @mrunalp @vishh @hqhq @LK4D4 @crosbymichael @philips @vbatts
I will travel to attend DockerCon 2016 Seattle on June 20 and 21.
May I have opportunity to talk with some of you face to face (as well as PR #447 and opencontainers/runtime-spec#267)?

@cyphar
Copy link
Member

cyphar commented Jun 16, 2016

@xiaochenshen I won't be there unfortunately (university exams). I will be speaking at ContainerCon Japan (13-15 July 2016) about the rootless container stuff we're doing in runC, so if you're going to that we can meet face-to-face.

/cc @opencontainers/runc-maintainers

@hqhq
Copy link
Contributor

hqhq commented Jun 16, 2016

I'll be there, welcome to stop by Huawei's booth, you'll probably find me there and we can have a talk.

@xiaochenshen
Copy link
Contributor Author

@hqhq Thanks, see you on DockerCon.

@xiaochenshen
Copy link
Contributor Author

@cyphar Thanks. I am interested in rootless container. But I am not sure if I can attend ContainerCon Japan then.

@cyphar
Copy link
Member

cyphar commented Jun 18, 2016

I'll post the talk slides and link to the talk recording on the dev@opencontainers.org mailing list once they're up.

@xiaochenshen
Copy link
Contributor Author

@crosbymichael @hqhq Nice meeting you in DockerCon! And thank you for your suggestions.

Intel RDT CAT kernel patch is subject to change to non-cgroup interface for some reasons. This proposal will be changed accordingly. But I will figure out if we can still keep "runtime resource constraints" structure which is aligned with OCI runtime-spec.

@xiaochenshen xiaochenshen changed the title Proposal: Intel RDT/CAT cgroup support in runc/libcontainer Proposal: Intel RDT/CAT ~~cgroup~~ support in runc/libcontainer Aug 8, 2016
@xiaochenshen xiaochenshen changed the title Proposal: Intel RDT/CAT ~~cgroup~~ support in runc/libcontainer Proposal: Intel RDT/CAT cgroup support in runc/libcontainer Aug 8, 2016
xiaochenshen added a commit to xiaochenshen/runtime-spec that referenced this issue Aug 9, 2016
Add support for Intel Resource Director Technology (RDT) / Cache Allocation
Technology (CAT). Add L3 cache resource constraints in Linux-specific
configuration.

This is the prerequisite of this runc proposal:
opencontainers/runc#433

For more information about Intel RDT/CAT, please refer to:
opencontainers/runc#433

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Aug 9, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual and the kernel document:
https://lkml.org/lkml/2016/7/12/747

About Intel RDT/CAT kernel interface:
Intel Cache Allocation Technology (CAT) is a sub-feature of Resource Director
Technology (RDT), which currently supports L3 cache resource allocation.

In Linux kernel, it is exposed via "resource control" filesystem, which is a
"cgroup-like" interface.

Intel RDT "resource control" filesystem hierarchy:
/sys/fs/rscctrl
|-- cpus
|-- info
|   |-- info
|   |-- l3
|       |-- domain_to_cache_id
|       |-- max_cbm_len
|       |-- max_closid
|-- schemas
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemas
    |-- tasks

The file `tasks` has all task ids belonging to the partition "container_id".
The task ids in the file will be added or removed among partitions. A task id
only stays in one directory at the same time.

The file `schemas` has allocation masks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
Which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a "partition" should be a subset of the CBM in root. Kernel
will check if it is valid when writing. e.g., 0xfffff in root indicates the
max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some
valid CBM values to set in a "partition": 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

The file `cpus` has a cpu mask that specifies the CPUs that are bound to the
schemas. Any tasks scheduled on the cpus will use the schemas.

Comparing with cgroups, intelRdt has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout. When intelRdt is joined, the statistics can be collected
from a container.

For more information about Intel RDT/CAT kernel interface:
https://lkml.org/lkml/2016/7/12/764

An example:
There are two L3 caches in the two-socket machine, the default CBM is 0xfffff
and the max CBM length is 20 bits. This configuration assigns 4/5 of L3 cache
id 0 and the whole L3 cache id 1 for the container:
"linux": {
	"resources": {
		"intelRdt": {
			"l3CacheSchema": "L3:0=ffff0;1=fffff",
			"L3CacheCpus": "00000000,00000000,00000000,00000000,00000000,00000000"
		}
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Aug 10, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual and the kernel document:
https://lkml.org/lkml/2016/7/12/747

About Intel RDT/CAT kernel interface:
Intel Cache Allocation Technology (CAT) is a sub-feature of Resource Director
Technology (RDT), which currently supports L3 cache resource allocation.

In Linux kernel, it is exposed via "resource control" filesystem, which is a
"cgroup-like" interface.

Intel RDT "resource control" filesystem hierarchy:
/sys/fs/rscctrl
|-- cpus
|-- info
|   |-- info
|   |-- l3
|       |-- domain_to_cache_id
|       |-- max_cbm_len
|       |-- max_closid
|-- schemas
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemas
    |-- tasks

The file `tasks` has all task ids belonging to the partition "container_id".
The task ids in the file will be added or removed among partitions. A task id
only stays in one directory at the same time.

The file `schemas` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a "partition" should be a subset of the CBM in root. Kernel
will check if it is valid when writing. e.g., 0xfffff in root indicates the
max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some
valid CBM values to set in a "partition": 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

The file `cpus` has a cpu bitmask that specifies the CPUs that are bound to the
schemas. Any tasks scheduled on the cpus will use the schemas.

Comparing with cgroups, intelRdt has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout. When intelRdt is joined, the statistics can be collected
from a container.

For more information about Intel RDT/CAT kernel interface:
https://lkml.org/lkml/2016/7/12/764

An example for runc:
There are two L3 caches in the two-socket machine, the default CBM is 0xfffff
and the max CBM length is 20 bits. This configuration assigns 4/5 of L3 cache
id 0 and the whole L3 cache id 1 for the container:
"linux": {
	"resources": {
		"intelRdt": {
			"l3CacheSchema": "L3:0=ffff0;1=fffff",
			"L3CacheCpus": "00000000,00000000,00000000,00000000,00000000,00000000"
		}
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Aug 10, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual and the kernel document:
https://lkml.org/lkml/2016/7/12/747

About Intel RDT/CAT kernel interface:
In Linux kernel, the interface is defined and exposed via "resource control"
filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t rscctrl rscctrl /sys/fs/rscctrl
tree /sys/fs/rscctrl
/sys/fs/rscctrl
|-- cpus
|-- info
|   |-- info
|   |-- l3
|       |-- domain_to_cache_id
|       |-- max_cbm_len
|       |-- max_closid
|-- schemas
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemas
    |-- tasks

The file `tasks` has all task ids belonging to the partition "container_id".
The task ids in the file will be added or removed among partitions. A task id
only stays in one directory at the same time.

The file `schemas` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a "partition" should be a subset of the CBM in root. Kernel
will check if it is valid when writing. e.g., 0xfffff in root indicates the
max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some
valid CBM values to set in a "partition": 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

The file `cpus` has a cpu bitmask that specifies the CPUs that are bound to the
schemas. Any tasks scheduled on the cpus will use the schemas.

For more information about Intel RDT/CAT kernel interface:
https://lkml.org/lkml/2016/7/12/764

An example for runc:
There are two L3 caches in the two-socket machine, the default CBM is 0xfffff
and the max CBM length is 20 bits. This configuration assigns 4/5 of L3 cache
id 0 and the whole L3 cache id 1 for the container:
"linux": {
	"resources": {
		"intelRdt": {
			"l3CacheSchema": "L3:0=ffff0;1=fffff",
			"L3CacheCpus": "00000000,00000000,00000000,00000000,00000000,00000000"
		}
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Aug 11, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual and the kernel document:
https://lkml.org/lkml/2016/7/12/747

About Intel RDT/CAT kernel interface:
In Linux kernel, the interface is defined and exposed via "resource control"
filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t rscctrl rscctrl /sys/fs/rscctrl
tree /sys/fs/rscctrl
/sys/fs/rscctrl
|-- cpus
|-- info
|   |-- info
|   |-- l3
|       |-- domain_to_cache_id
|       |-- max_cbm_len
|       |-- max_closid
|-- schemas
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemas
    |-- tasks

The file `tasks` has all task ids belonging to the partition "container_id".
The task ids in the file will be added or removed among partitions. A task id
only stays in one directory at the same time.

The file `schemas` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a "partition" should be a subset of the CBM in root. Kernel
will check if it is valid when writing. e.g., 0xfffff in root indicates the
max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some
valid CBM values to set in a "partition": 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

The file `cpus` has a cpu bitmask that specifies the CPUs that are bound to the
schemas. Any tasks scheduled on the cpus will use the schemas.

For more information about Intel RDT/CAT kernel interface:
https://lkml.org/lkml/2016/7/12/764

An example for runc:
There are two L3 caches in the two-socket machine, the default CBM is 0xfffff
and the max CBM length is 20 bits. This configuration assigns 4/5 of L3 cache
id 0 and the whole L3 cache id 1 for the container:
"linux": {
	"resources": {
		"intelRdt": {
			"l3CacheSchema": "L3:0=ffff0;1=fffff",
			"L3CacheCpus":
			"00000000,00000000,00000000,00000000,00000000,00000000"
		}
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Aug 11, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual and the kernel document:
https://lkml.org/lkml/2016/7/12/747

About Intel RDT/CAT kernel interface:
In Linux kernel, the interface is defined and exposed via "resource control"
filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t rscctrl rscctrl /sys/fs/rscctrl
tree /sys/fs/rscctrl
/sys/fs/rscctrl
|-- cpus
|-- info
|   |-- info
|   |-- l3
|       |-- domain_to_cache_id
|       |-- max_cbm_len
|       |-- max_closid
|-- schemas
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemas
    |-- tasks

The file `tasks` has all task ids belonging to the partition "container_id".
The task ids in the file will be added or removed among partitions. A task id
only stays in one directory at the same time.

The file `schemas` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a "partition" should be a subset of the CBM in root. Kernel
will check if it is valid when writing. e.g., 0xfffff in root indicates the
max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some
valid CBM values to set in a "partition": 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

The file `cpus` has a cpu bitmask that specifies the CPUs that are bound to the
schemas. Any tasks scheduled on the cpus will use the schemas.

For more information about Intel RDT/CAT kernel interface:
https://lkml.org/lkml/2016/7/12/764

An example for runc:
There are two L3 caches in the two-socket machine, the default CBM is 0xfffff
and the max CBM length is 20 bits. This configuration assigns 4/5 of L3 cache
id 0 and the whole L3 cache id 1 for the container:
"linux": {
	"resources": {
		"intelRdt": {
			"l3CacheSchema": "L3:0=ffff0;1=fffff",
			"L3CacheCpus":
			"00000000,00000000,00000000,00000000,00000000,00000000"
		}
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Aug 11, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual and the kernel document:
https://lkml.org/lkml/2016/7/12/747

About Intel RDT/CAT kernel interface:
In Linux kernel, the interface is defined and exposed via "resource control"
filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t rscctrl rscctrl /sys/fs/rscctrl
tree /sys/fs/rscctrl
/sys/fs/rscctrl
|-- cpus
|-- info
|   |-- info
|   |-- l3
|       |-- domain_to_cache_id
|       |-- max_cbm_len
|       |-- max_closid
|-- schemas
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemas
    |-- tasks

The file `tasks` has all task ids belonging to the partition "container_id".
The task ids in the file will be added or removed among partitions. A task id
only stays in one directory at the same time.

The file `schemas` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a "partition" should be a subset of the CBM in root. Kernel
will check if it is valid when writing. e.g., 0xfffff in root indicates the
max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some
valid CBM values to set in a "partition": 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

The file `cpus` has a cpu bitmask that specifies the CPUs that are bound to the
schemas. Any tasks scheduled on the cpus will use the schemas.

For more information about Intel RDT/CAT kernel interface:
https://lkml.org/lkml/2016/7/12/764

An example for runc:
There are two L3 caches in the two-socket machine, the default CBM is 0xfffff
and the max CBM length is 20 bits. This configuration assigns 4/5 of L3 cache
id 0 and the whole L3 cache id 1 for the container:
"linux": {
	"resources": {
		"intelRdt": {
			"l3CacheSchema": "L3:0=ffff0;1=fffff",
			"L3CacheCpus":
			"00000000,00000000,00000000,00000000,00000000,00000000"
		}
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runc that referenced this issue Aug 11, 2016
This PR fixes issue opencontainers#433
opencontainers#433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Intel Cache Allocation Technology (CAT) is a sub-feature of RDT. Currently L3
Cache is the only resource that is supported in RDT.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual and the kernel document:
https://lkml.org/lkml/2016/7/12/747

About Intel RDT/CAT kernel interface:
In Linux kernel, the interface is defined and exposed via "resource control"
filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t rscctrl rscctrl /sys/fs/rscctrl
tree /sys/fs/rscctrl
/sys/fs/rscctrl
|-- cpus
|-- info
|   |-- info
|   |-- l3
|       |-- domain_to_cache_id
|       |-- max_cbm_len
|       |-- max_closid
|-- schemas
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemas
    |-- tasks

The file `tasks` has all task ids belonging to the partition "container_id".
The task ids in the file will be added or removed among partitions. A task id
only stays in one directory at the same time.

The file `schemas` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a "partition" should be a subset of the CBM in root. Kernel
will check if it is valid when writing. e.g., 0xfffff in root indicates the
max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some
valid CBM values to set in a "partition": 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

The file `cpus` has a cpu bitmask that specifies the CPUs that are bound to the
schemas. Any tasks scheduled on the cpus will use the schemas.

For more information about Intel RDT/CAT kernel interface:
https://lkml.org/lkml/2016/7/12/764

An example for runc:
There are two L3 caches in the two-socket machine, the default CBM is 0xfffff
and the max CBM length is 20 bits. This configuration assigns 4/5 of L3 cache
id 0 and the whole L3 cache id 1 for the container:
"linux": {
	"resources": {
		"intelRdt": {
			"l3CacheSchema": "L3:0=ffff0;1=fffff",
			"L3CacheCpus":
			"00000000,00000000,00000000,00000000,00000000,00000000"
		}
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runtime-spec that referenced this issue Jan 12, 2017
Add support for Intel Resource Director Technology (RDT) / Cache Allocation
Technology (CAT). Add L3 cache resource constraints in Linux-specific
configuration.

This is the prerequisite of this runc proposal:
opencontainers/runc#433

For more information about Intel RDT/CAT, please refer to:
opencontainers/runc#433

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
@xiaochenshen
Copy link
Contributor Author

@cyphar @crosbymichael @hqhq @mrunalp @vishh
/cc @opencontainers/runc-maintainers

Design proposal updates (2017-01-18)

To address @crosbymichael and @cyphar 's comments #1198 (comment) and #1198 (comment), the design is updated:

It adds a new "ResourceManager" structure as the base interface for all resource managers, such as cgroups manager and incoming IntelRdt manager.

All registered resource managers are consolidated in linuxContainer structure. We can apply to unified operations (e.g., Apply(), Set(), Destroy()) using all of the registered resource managers.

Currently, cgroups manager is the single resource manager in libcontainer. Linux kernel 4.10 will introduce Intel RDT/CAT feature, the kernel interface is exposed via "resource control" filesystem, which is a cgroup-like interface. In order to support Intel RDT/CAT in libcontainer, we need a new resource manager (IntelRdt manager) outside cgroups.

The PRs to implement the design:

  1. runtime-spec: specs-go/config: add Intel RDT/CAT Linux support runtime-spec#630
  2. runc: libcontainer: add support for Intel RDT/CAT in runc #1279

xiaochenshen added a commit to xiaochenshen/runtime-spec that referenced this issue Jan 17, 2017
Add support for Intel Resource Director Technology (RDT) / Cache Allocation
Technology (CAT). Add L3 cache resource constraints in Linux-specific
configuration.

This is the prerequisite of this runc proposal:
opencontainers/runc#433

For more information about Intel RDT/CAT, please refer to:
opencontainers/runc#433

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runtime-spec that referenced this issue Jan 17, 2017
Add support for Intel Resource Director Technology (RDT) / Cache Allocation
Technology (CAT). Add L3 cache resource constraints in Linux-specific
configuration.

This is the prerequisite of this runc proposal:
opencontainers/runc#433

For more information about Intel RDT/CAT, please refer to:
opencontainers/runc#433

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runtime-spec that referenced this issue Jan 17, 2017
Add support for Intel Resource Director Technology (RDT) / Cache Allocation
Technology (CAT). Add L3 cache resource constraints in Linux-specific
configuration.

This is the prerequisite of this runc proposal:
opencontainers/runc#433

For more information about Intel RDT/CAT, please refer to:
opencontainers/runc#433

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runtime-spec that referenced this issue Jan 17, 2017
Add support for Intel Resource Director Technology (RDT) / Cache Allocation
Technology (CAT). Add L3 cache resource constraints in Linux-specific
configuration.

This is the prerequisite of this runc proposal:
opencontainers/runc#433

For more information about Intel RDT/CAT, please refer to:
opencontainers/runc#433

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
@rektide
Copy link

rektide commented Feb 20, 2017

This is really good & interesting work & I'm glad it's happening. This topic is perhaps more appropriate for somelike like LKML, but I do think it's very scary that RDT is implemented as a parallel resource controller to cgroups. From a total laymen's perspective, it seems horrifically sad that it was not implemented within the existing cgroups framing. Quote:

The kernel interface of Intel RDT/CAT is defined and exposed via "resource control" filesystem, which is a "cgroup-like" interface. By nature, the filesystem interface is "resource control group" for L3 cache. Comparing with cgroup, the interface has similar process management lifecycle and interfaces in a container (e.g. group directory operations, file tasks and etc.).

Should be amazingly useful tech to push load sharing to far greater heights, but it really disturbs me a lot that it's an entirely parallel system to what Linux and containers have built themselves upon so far, cgroups.

@xiaochenshen
Copy link
Contributor Author

@rektide
FYI - There was a very long period debate on kernel interface of Intel RDT/CAT in Linux Kernel Mailing List (LKML). In fact, the original kernel patches submitted to LKML is cgroup interface, and based on that, I have submitted the first version runC patch for it in #447. But during the code review, some Linux kernel maintainers (e.g., cgroup, x86 maintainers) rejected the cgroup patch for some reasons, such as the HW capability's incompatibility with cgroup hierarchy, limited granularity of resource control in some user cases and etc. After long discussion, the consensus in Linux kernel community is finalized: the resource control filesystem is the acceptable interface instead of cgroup. You can find the improvement for these issues with the new kernel interface: https://lkml.org/lkml/2016/7/12/746

Frankly, most people in container world (including me) likes cgroup rather than another kernel interface. But as a fait accompli, the CAT kernel patch with resource control filesystem interface have been merged into Linux upstream kernel in 4.10. What we are working on this issue is to enable Intel RDT/CAT feature in runC based on the new Linux kernel interface.

xiaochenshen added a commit to xiaochenshen/runtime-spec that referenced this issue Mar 6, 2017
Add support for Intel Resource Director Technology (RDT) / Cache Allocation
Technology (CAT). Add L3 cache resource constraints in Linux-specific
configuration.

This is the prerequisite of this runc proposal:
opencontainers/runc#433

For more information about Intel RDT/CAT, please refer to:
opencontainers/runc#433

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runtime-spec that referenced this issue Mar 8, 2017
Add support for Intel Resource Director Technology (RDT) / Cache Allocation
Technology (CAT). Add L3 cache resource constraints in Linux-specific
configuration.

This is the prerequisite of this runc proposal:
opencontainers/runc#433

For more information about Intel RDT/CAT, please refer to:
opencontainers/runc#433

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
xiaochenshen added a commit to xiaochenshen/runtime-spec that referenced this issue Mar 10, 2017
Add support for Intel Resource Director Technology (RDT) / Cache Allocation
Technology (CAT). Add L3 cache resource constraints in Linux-specific
configuration.

This is the prerequisite of this runc proposal:
opencontainers/runc#433

For more information about Intel RDT/CAT, please refer to:
opencontainers/runc#433

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
stefanberger pushed a commit to stefanberger/runc that referenced this issue Sep 8, 2017
This was raised during reviews with folks working on Windows Containers.  

This squashes commits from PR opencontainers#433

Signed-off-by: Rob Dolin <RobDolin@microsoft.com>
@xiaochenshen xiaochenshen changed the title Proposal: Intel RDT/CAT support in runc/libcontainer Proposal: Intel RDT/CAT support for OCI/runc and Docker Sep 11, 2017
@CodeJuan
Copy link

@xiaochenshen
What is the limit of the subdir of /sys/fs/resctrl? Could I create more than 16 subdir of /sys/fs/ctrl?
Thank you very much.

@xiaochenshen
Copy link
Contributor Author

@CodeJuan

What is the limit of the subdir of /sys/fs/resctrl? Could I create more than 16 subdir of /sys/fs/ctrl?
Thank you very much.

The cache allocation limit depends on Intel CPU models. You could get the read-only info from /sys/fs/resctrl/info/L3/num_closids. And you could create up to (num_closios - 1) RDT CTRL_MON groups because 1 closid has been reserved to root group.

For more details, please refer to Linux kernel Intel RDT documentation:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

@CodeJuan
Copy link

CodeJuan commented Jan 15, 2018

@xiaochenshen

The cache allocation limit depends on Intel CPU models. You could get the read-only info from /sys/fs/resctrl/info/L3/num_closids. And you could create up to (num_closios - 1) RDT CTRL_MON groups because 1 closid has been reserved to root group.
For more details, please refer to Linux kernel Intel RDT documentation:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

Thanks for your kindly replies.
I found IntelRdtManager would create a new directory(or group?) for a new container, so how to create more containers? For example, more than num_closios.
I have tried to create a group A before starting container, then write the PID1 of container to /sys/fs/resctrl/A/tasks . I created num_closios containers, and it works. I'm wondering if I am right?

@xiaochenshen
Copy link
Contributor Author

@CodeJuan

I found IntelRdtManager would create a new directory(or group?) for a new container, so how to create more containers? For example, more than num_closios.
I have tried to create a group A before starting container, then write the PID1 of multiple containers to /sys/fs/resctrl/A/tasks . It works.

Thanks for pointing out this. I have thought about this before. I know RDT group sharing between containers is a real user scenario.
Intel RDT has hardware limit (num_closids) for RDT control groups. But at present, we don't support group sharing between containers yet for some reasons. One of the reasons is security concern, we have to expose more resctrl fs details of host or other containers than expected.

I have added it to my TODO list, Hope we can find a tradeoff solution in future.

@crosbymichael
Copy link
Member

I think this can be closed since it has been implemented

@xiaochenshen
Copy link
Contributor Author

@crosbymichael

Thank you, Michael.
I will continue to update the description of this proposal to track the TODO list for Docker support:

TODO list - Intel RDT/CAT support in Docker:
3. Intel RDT/CAT support in containerd
4. Intel RDT/CAT support in Docker Engine (moby/moby)
5. Intel RDT/CAT support in Docker CLI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants