Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: CPU Affinity and NUMA Topology Awareness #171

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Proposal: CPU Affinity and NUMA Topology Awareness
  • Loading branch information
derekwaynecarr committed Dec 13, 2016
commit daa68fb322b5d4f765bb81382fc39f4c77d9b972
216 changes: 216 additions & 0 deletions contributors/design-proposals/cpu-affinity-numa-awareness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
**Author**: Derek Carr

**Last** **Updated**: 12/12/2016

**Status**: Pending Review

# CPU Affinity and NUMA Topology Awareness

This proposal describes enhancements to the Kubernetes API to improve the
utilization of compute resources for containers that have a need to avoid
cross NUMA node memory access by containers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/by containers//


This proposal describes the set of changes recommended to support NUMA
awareness in the `Node` and `Pod` API to support this feature. It does not
yet prescribe specifically the actor that performs NUMA aware scheduling
decisions, instead it focuses purely on the underlying primitives required
to support building more advanced scheduling capabilities.

## What is NUMA?

Non-uniform memory architecture (NUMA) describes multi-socket machines that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-uniform memory architecture => Non-Uniform Memory Architecture (NUMA)
Very minor change, enhances readability.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A in NUMA is "Access" https://en.wikipedia.org/wiki/Non-uniform_memory_access

Should it be "Non-Uniform Memory Access (NUMA) architecture" ?

subdivide memory into nodes where each node is associated with a list of
CPU cores. This architecture is the norm for modern machines.

An interconnect bus provides connections between nodes so each CPU can
access all memory. The interconnect can be overwhelmed by concurrent
cross-node traffic, and as a result, processes that need to access memory
on a different node can experience increased latency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB, the increased latency of cross-node memory access exists regardless of whether the interconnect is overwhelmed or not - remote node memory access always has a penalty based on the distance between nodes. For example, if local node distance is 10 and distance to remote node is 15, then memory access will always be at least x1.5 slower.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@berrange - agreed. will clean up text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "a process running on a different node from a memory bank it needs to access can experience increased latency"


As a result, many applications see a performance benefit when the workload
is affined to a particular NUMA node and CPU core(s).

## NUMA topology

In order to support NUMA affined workloads, the `Node` must make its
NUMA topology available for introspection by other agents that schedule
pods.

This proposal recommends that the `NodeStatus` is augmented as follows:

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use "```go"?

// NodeStatus is information about the current status of a node.
type NodeStatus struct {
...
// Topology represents the NUMA topology of a node to aid NUMA aware scheduling.
// +optional
Topology NUMATopology
}

// NUMATopology describes the NUMA topology of a node.
type NUMATopology struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the convention of "labels everywhere", would you embed k8s.io/kubernetes/pkg/apis/meta/v1.TypeMeta here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeMeta is more appropriate for top-level objects in the API, the topology is embedded in Node, maybe I am missing a use case you had for labeling of numa nodes within a Node? Isn't the id sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not including the use case. I was thinking of representing shared (but not consumable) per-numa-node resources like device locality. Anyway, in that case it would be more appropriate as part of NUMANode.

// NUMANodes represents the list of NUMA nodes in the topology.
NUMANodes []NUMANode
}

// NUMANode describes a single NUMA node.
type NUMANode struct {
// Identifies a NUMA node on a single host.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the way this is structured would complicate the scheduler which currently rips through capacities for most of it's matching. If it's possible to expand capacities that might make this easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timothysc -- i am open to ideas on how to flatten, but we still need to know the topology to know what cores are in what node, the siblings, and their distances (possibly) in future.

NUMANodeID string
// Capacity represents the total resources associated to the NUMA node.
// cpu: 4 <number of cores>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: is this intended to represent the logical CPU ids as reported by Linux, or only the number of physical cores on the socket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had intended for this to align with what was returned in the following:

$ sudo lscpu | grep NUMA
NUMA node(s):          1
NUMA node0 CPU(s):     0-3

so in that case it was logical and not physical cores. this was consistent with what we do today when reporting CPU counts in cAdvisor via $ cat /proc/cpuinfo | grep 'processor\t*' | wc -l

i agree we will need to know what are siblings in the numa topology.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was consistent with what we do today when reporting CPU counts in cAdvisor

Is this (logical) also what is returned in "cpu" in Allocatable and Capacity in NodeStatus?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp -- yes. the node returns logical cores to in "cpu" in Allocatable and Capacity.

// memory: <amount of memory in normal page size>
// hugepages: <amount of memory in huge page size>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs much more work to correctly future proof it to cope well with huge pages. Historically x86 only had 1 huge page size (2MB or 4MB depending on kernel address mode), but these days you can also have 1 GB huge pages. On PPC architecture there are a huge number of page sizes available. Any single host can have multiple page sizes available concurrently.

So memory needs to be represented as a struct in its own right eg something more like

type PageInfo struct {
PageSize uint
Count uint64
Used uint64
}

Then the node would have an array of info Memory []PageInfo

The 'cpu: 4' count seems redundant given that you have a CPUSet string below from which you could easily determine the count if represented as an array instead of string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally, we have discussed the need to have a mechanism to describe the attributes of a resource better than what we have in the current model. usage information fluctuates so we tend to not capture in the node status and instead would use a monitoring pipeline for that information. the pagesize will need to be reflected. its possible the convention will be hugepages.<size>: X. Mostly trying to capture the relationship between NUMA and hugepages that you have called out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with Derek here, they could be represented as separate entries in the resource list. Given that pagesize could be represented by a conventional resource name, nr_hugepages and free_hugepages can be derived from the values in capacity and allocatable.

Also agree that the better way to address the need for more expressive resource types is to iterate the resource model versus working around it with many custom fields. By the same token, the CPUSet field could be represented as a Range type if it were available.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as there's a way to represent multiple different huge pages sizes concurrently I'm ok - if we want to use a flat structure and so have multiple "hugepages.: X", that'll work fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, and I know this is contrarian, we don't. or maybe we don't YET. We can not represent every facet of every hardware system.

Copy link
Contributor

@jaypipes jaypipes Dec 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin Amen. I'm kind of chuckling as I read some of this (and the huge page proposal) thinking how much crazy code went into OpenStack Nova for all of this NFV (or NFV-centric) functionality. It made (and continues to make) the scheduler and resource tracking subsystems in Nova a total pain in the behind to read and work with.

Also worth noting: what is the plan to functionally test all of this in a continuous manner? Who will provide the hardware and systems? This is something we spent literally years trying to get companies to contribute and even now still struggle to get stable third-party CI systems integrated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are and will be some workloads that we don't handle perfectly. I can live with that.

Capacity ResourceList
// Allocatable represents the resources of a NUMA node that are available for scheduling.
// +optional
Allocatable ResourceList
// CPUSet represents the physical numbers of the CPU cores
// associated with this node.
// Example: 0-3 or 0,2,4,6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: are these limited to the physical cpu ids, and if so is it planned to support scheduling onto their hyperthread siblings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we have heard from some users that they would like to consume a pair of hyperthreads from one physical core.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was logical core listing, we will need to let you know what the siblings are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for simplicity imo we should bound affinity to the core. I question the cost/benefits to allocating to hyperthreads.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 90% general usage, I think it is fine to ignore the sockets/cores/threads distinction and just treat all logical cpus as having equivalent "power". In the more niche use cases there's definitely cases where considering threads as specal will help. There are applications which are very latency sensitive and benefit from having their threads run on a pair of hyperthread siblings in order to maximise cache hits. Conversely there's applications which absolutely don't ever want to be on hyperthread siblings because they'll contend on the same shared resources.

That all said, I think it is possible to retrofit support for thread at a later date. It simply requires adding a new array attribute later listing the thread sibling IDs. So this simple list of logical CPU IDs is sufficient granularity for a first impl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@berrange but why wait to address this use case? Most of the users we have talked to that need core pinning are the extreme low-latency people. Without this level of topology awareness the feature will not meet their needs and we'll still have users hacking around the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cpu in Capacity always guaranteed to be equal to the number of cores listed here? (If not, why not?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp -- the list of cpus should always match the equal number of cores in alloctable. if the kubelet itself is pegged to a particular core, its possible the capacity != allocatable.

// The values are expressed in the List Format syntax specified
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
CPUSet string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the string syntax is user friendly, but not very machine friendly as all operations would have to convert to a bitmap to do any useful calculations. IOW, it'd be better to represent as something like '[]int' storing the CPU numbers in that node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume we would have internal utilities to convert to []int

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you could provide utility functions to convert to a []int format, but IMHO it is preferable to store the data in a canonical format in the first place and avoid the need to convert it everywhere

}
```

## Node Configuration

### Isolating host processes

By default, load balancing is done across all CPUs, except those marked isolated
using the kernel boot time `isolcpus=` argument. When configuring a node to support
CPU and NUMA affinity, many operators may wish to isolate host processes to particular
cores.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the isolcpus= kernel argument is not suitable for general purpose isolation of host OS processes from container/pod processes because it has two effects, only one of which is desirable. Aside from isolating the host processes, it also disables schedular load balancing from those isolated CPUs. If you have a container that is placed on isolated CPUs, the processes in that container will never move between available CPUs - they'll forever run on the first CPU on which they were placed, even if all other CPUs are idle.

The systemd CPUAffinity setting should pretty much always be used - isolcpus when running real-time processes on the dedicated CPUs where you genuiently don't want the kernel scheduler rebalancing CPUs/.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i raised the former because not all distros running kube were on systemd. agreed with above.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if a distro is not running systemd, they should never use isolcpus= unless the containers are intended to run real-time tasks where each individual process is manually pinned to a suitable host CPU. Non-systemd distros would have to configure cgroups cpuset controller in some custom manner to setup isolation.


It is recommended that operators set a CPU value for `--system-reserved`
in whole cores that aligns with the set of cpus that are made available to the default
kernel scheduling algorithm. If an operator is on a `systemd` managed platform, they
may choose instead to set the `CPUAffinity` value for the root slice to the set of CPU
cores that are reserved for the host processes.

**TODO**

1. how should `kubelet` discover the reserved `cpu-set` value?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could infer it based on the CPU mask it is running with itself. eg if you have CPUAffinity=0-3 on a 16 cpus system, then kubelet would see its affinity as 0-3, and so it can infer that 4-15 are the ones reserved for pod usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i was hoping to avoid the need for additional flags, will think one this some.

1. in a numa system, `kubelet` reservation for memory needs to be removed from
a particular numa node capacity so numa node allocatable is as expected.

### Configuring Taints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand the reason for the taints. I suspect there is some misunderstanding about the intention of taints/tolerations, but I'm not clear enough on why you are using them to be able to judge. Can you explain what problem(s) you are trying to solve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problems:

  1. nodes that support dedicated cpu cores are special
  2. pods that require dedicated cpu cores are special
  3. needed a way to distinguish consumption of a dedicated core as distinct/special and did not want to change resource model

I basically chose to use taints/tolerations as a means of decorating the resource description and scheduling requirement.


The following `Taint` keys are defined to enable CPU pinning and NUMA awareness.

#### CPUAffinity

* Effect: `NoScheduleNoAdmitNoExecute`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have to be represented as two taints here and elsewhere. See kubernetes/kubernetes#28082 for details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balajismaniam ack, was talking with @aveshagarwal about this in the background. in addition, just using NoScheduleNoAdmit would be sufficient if we drain nodes before applying taint.

* Potential values:
* `dedicated`

If `dedicated`, all pods that match this taint will require dedicated compute resources. Each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (here and repeated below): s/match this taint/tolerate this taint

pod bound to this node must request CPU in whole cores. The CPU limit must equal the request.

#### NUMACPUAffinity

* Effect: `NoScheduleNoAdmitNoExecute`
* Potential values:
* `strict`

If `strict`, all pods that match this taint must request CPU (whole or fractional cores) that
fit a single NUMA node `cpu` allocatable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels somewhat inflexible wrt potential future expansion. It would be better to express a desired number of node count. eg NUMANodes=N, where N is one or more, to indicate how many NUMA nodes it is willing / able to consume. eg an application inside a container may be capable of dividing its workload into 2 pieces, so could deal wtih being spread on 2 NUMA nodes, thus could request NUMANodes=1,2 to indicate its able to deal with 1 or 2 NUMA nodes, but not more than that.


#### NUMAMemoryPolicy

* Effect: `NoScheduleNoAdmitNoExecute`
* Potential values:
* `strict`
* `preferred`

If `strict`, all pods that match this taint must request `memory` that fits it's assigned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it's/its

NUMA node `memory` allocatable.

If `preferred`, all pods that match this taint are not required to have their `memory` request
fit it's assigned NUMA node `memory` allocatable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it's/its


## Pod Specification

### API changes

The following API changes are proposed to the `PodSpec` to allow CPU and NUMA affinity to be defined.

```
// PodSpec is a description of a pod
type PodSpec struct {
...
// NodeName is a request to schedule this pod onto a specific node. If it is non-empty,
// the scheduler simply schedules this pod onto that node, assuming that it fits resource
// requirements.
// +optional
NodeName string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you list this here? Are you planning to somehow use the existing NodeName field for NUMA node scheduling? It wasn't clear to me from the comment (which appears to be identical to the existing NodeName comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i listed it here just to be clear that numa and cpu affined workloads are a triplet of node name + numa node + cpuset.

the only thing that is new is the numa node id and cpu affinity values.

// Identifies a NUMA node that affines the pod. If it is non-empty, the value must
// correspond to a particular NUMA node on the same node that the pod is scheduled against.
// This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
// are present on the pod.
// +optional
NUMANodeID string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've said before and I still assert - this is a terrible plan. Offering up this much control is going to kill off any chances we have of actually doing the right thing automatically for most users. This is an attractive nuisance (https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine) - people are going to get hurt by it. If the borg experience counts for anything, the lesson is that saying "yes" to things like this turned out terribly, and is a drag on the system.

I proposed a less first-class alternative. Something like: annotate pods and run a highly-privileged agent on the node react to pods being scheduled by programming cpusets and calling numactl and whatever else it needs to do. Make this VERY explicit to use and DO NOT make it part of the API.

If we need to collect NUMA status and use something about NUMA as a hint to scheduling, I can MAYBE see this, maybe even a different scheduler enitrely.

If we need to sell "dedicated caches", then we should add that as resource. If I need NUMA because I want an LLC to myself, let me spec that.

If we need to sell "link to high-perf IO device", then we should add that as resource.

I'm trying to not overreact, but I have a visceral response to this.

// CPUAffinity controls the CPU affinity of the executed pod.
// If it is non-empty, the value must correspond to a particular set
// of CPU cores in the matching NUMA node on the machine that the pod is scheduled against.
// This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
// are present on the pod.
// The values are expressed in the List Format syntax specified here:
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
// +optional
CPUAffinity string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is responsible for setting the NUMANodeID / CPUAffinity values ? Is that to be set by the end user directly, or to be set indirectly by the schedular when making placement decision. If the latter, then that's ok, but the former is not a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's intended to be set by a scheduling agent of some kind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid question: are the identifiers used in this list global across the machine, or local per NUMA node? Is there any reason why you would specify CPUAffinity without specifying NUMANodeID? (I assume a scheduling agent would know both aspects of the topology.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the identifiers are global across the machine.

i actually debated not including NUMANodeID and just having CPUAffinity, and am open to discussion on that. If NUMANodeID is set CPUAffinity is either the 1-N cores on that node, or some subset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^^ this too.

```

### REST API changes

The `/pod/<pod-name>/bind` operation will allow updating the NUMA and CPU
affinity values. The same permissions required to schedule a pod to a
node in the cluster will be required to bind a pod to a particular NUMA node
and CPU set.

### Tolerations

Pods that require CPU and NUMA affinity prior to execution must set the
appropriate `Tolerations` for the associated taints.

### Multiple containers

If a pod has multiple containers, the set of containers must all fit
a specific NUMA node, and the set of affined CPUs are shared among containers.

Pod level cgroups are used to actually affine the container to the specified
CPU set.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a proposal or other reference on how pod-level cgroups will work and programs can understand their assignments?

We've had a long-standing discussion around how can a C++ program discover/understand it's resource boundaries so that it can operate properly within them. A classic third-party example is a Java JVM auto-discovering how many GC threads it can create and how much memory it's been allocated. It would be desirable to have a path for codebases that may have previously assumed they had to manage a whole machine's worth of resources and handle setting their own scheduler affinities today to be able to understand their partial-machine resource set so they can make their own proper suballocations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcarstensen They could either feed in from ENV vars, or iirc there is a mount trick that is done to allow them see the world through a cgroup-i-fied lens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcarstensen -- recommend using the downward api in that context: kubernetes/website#591


## Resource Quota changes

Operators must be able to limit the consumption of dedicated CPU cores via quota.

## Kubelet changes

The `kubelet` will enforce the presence of the required pod tolerations assigned to the node.

The `kubelet` will pend the execution of any pod that is assigned to the node, but has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for how the proposal solves this problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Kubelet reject the pod if the specified NUMANodeID and/or CPUAffinity are already in use by an existing pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it would reject if there was a collision in the case that you asked for dedicated cores, and were not getting them.

not populated the required fields for a particular toleration.

* If the toleration `CPUAffinity` is present on a `Pod`, the pod will not start
any associated container until the `Pod.Spec.CPUAffinity` is populated.
* If the toleration `NUMAAffinity` is present on a `Pod`, the pod will not start

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/NUMAAffinity/NUMACPUAffinity/

any associated container until the `Pod.Spec.NUMANodeID` is populated.

The delayed execution of the pod enables both a single and dual-phase scheduler to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "single and dual-phase scheduler"? (And what do you mean by "delayed execution of the pod"?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that there is some debate in the community on whether the cluster scheduler assigns pods to cpu cores and numa nodes, or if a node local agent schedules it. we had some of this debate at kubecon. when i mention single or dual phase scheduler, i mean either a scheduler that does machine+numa aware scheduling (single) or a pair of schedulers where one does machine assignment, and another does numa assignment. personally, i am biased to the single approach, but i got the sense there was some debate in this space.

place pods on a particular NUMA node and set of CPU cores.

## Required work:

1. pod level cgroup support roll-out
1. implement support for `NoScheduleNoAdmitNoExecute` taint effect
1. expose NUMA topology in cAdvisor
1. expose NUMA topology in node status
1. pod level cgroup support for enabling cpu set

## Future considerations

1. Author `NUMATopologyPredicate` in scheduler to enable NUMA aware scheduling.
1. Restrict vertical autoscaling of CPU and NUMA affined workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to do with limiting dynamic logical core assignments, or specifically fixing the number of cores allocated for the lifetime of a pod? If the latter, what's the rationale for it? The default vertical auto-scaler (non-numa-aware) could emit impossible allocations (spanning sockets, fractional cores, etc) but you could also imagine a numa-aware vertical auto-scaler that can do the right thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i should phrase this differently. i mean to say that we should not just auto-scale these things ignorant of NUMA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this proposal assumes that we run workloads on monolithic hardware.

Our use case is different.
We have servers with 1,2 or 4 NUMA nodes. They have different CPU models (different number of cores) and different NUMA topologies. For example:
Server 1 has topology:

NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39

Server 2 has topology:

NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39

Consider a scenario:

We want to run workloads that perform best if they use all CPUs on one NUMA node.
If the server has 4 NUMA nodes, we will run 4 workloads on them.
It will be difficult to run this scenario with the proposed implementation.

Can we try a different approach?

Add a capability to specify CPUset to Opaque Integer Resources. One Opaque Integer Resource could be predefined (lets call it "numa"). When a user requests OIR numa 1, the scheduler would specify CPUset 0-9,20-29 on server-1 from my example, and 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40 on server-2.

This approach would also solve more advanced use cases. Lets say we want to run a workload that is pinned to one CPU that is close to a specific PCI device. On one server this CPU may be located on numa0, on another on numa1. In this case a user defines a new OIR (lets call it "fast-pci") and specifies CPUset for this OIR. It would be different for different servers. The schedule would pin the workload to the CPU close to the PCI on each server.

I assume that the OIR definition would be performed by a Node-Feature-Discovery pod (as defined in kubernetes/kubernetes#10570 (comment)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakub-d -- i don't think numa should be handled via OIR, but I see value in letting you have numa node capacity as a counted resource on node. I will think about how to handle that in this proposal. I think the major tension on the proposal is just the presence of NUMANodeID/CPUAffinity on the pod declaration which I think I would still want even with your proposed model. Thanks for the scenario feedback!