-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CPU manager proposal. #654
Merged
Merged
Changes from 1 commit
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
ae38e21
Added initial draft of CPU manager proposal.
ConnorDoyle a21b6ad
Added cache alloation implementation phase.
ConnorDoyle b961e4e
Fixed typos.
ConnorDoyle a6afac8
Updated code snippets to match PoC branch.
ConnorDoyle 79a2bb5
Minor formatting fix.
ConnorDoyle ca32930
Fixed typos, described sched_relax_domain_level.
ConnorDoyle 8714cae
Added CPU Manager block diagram.
ConnorDoyle 6eece46
Updated topo discovery section with decision.
ConnorDoyle 63d8db1
Tied up loose ends in proposal.
ConnorDoyle 694d2f4
Updated CPU manager configuration details.
ConnorDoyle 8c98636
Fixed review comments from @balajismaniam.
ConnorDoyle 3dfe261
s/floor/ceiling, recommend integer allocatable.cpu
ConnorDoyle a04d7eb
Merge pull request #1 from ConnorDoyle/cpu-manager-proposal-config
ConnorDoyle 5780b48
Updated staticPolicy.Start sketch.
ConnorDoyle 6bc03ac
Fix review comments from @derekwaynecarr.
ConnorDoyle File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Updated topo discovery section with decision.
- Loading branch information
commit 6eece46e694e70eb455c8ceee37b0f623fed1946
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should happen if there are enough CPUs on a node, but they're far away from each other (I'm requesting 2 CPUs, but the ones left are on different cores). For some workloads admitting such Pod may be acceptable, for some not. Would it be a good idea to make this behaviour configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, our plan is to have the first policy (static) provide the "best" allocation when assigning dedicated cores. So, at higher G pod density it would be possible to observe the topology fragmentation you described.
If it were configurable in the policy, do you have ideas about how you would like that to look (name for the knob, values, defaults etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it — the more I realize how complicated things actually are =/
Generally my perception is that by default CPUs of a single container inside a Pod should be as close to each other as possible, while the CPUs of a different containers inside a Pod should be as far as possible from each other. The only case I can imagine when I would want to pack as much dedicated containers on a single socket as possible is when I would want to launch multiple small (1-CPU) containers and then launch a single giant one, that asks for half the CPUs of a core.
For a container if we want to allow to control how close the CPUs are, I imagine it might be a double parameter in the pod spec. Smth like:
with dense and false as defaults.
separate
would mean to attempt to spread CPUs as much as possible anddense
would mean to pack as close as possible. Withstrict: false
meaning that workload can tolerate if the requirements are not met. Andstrict: true
on the other side should mean that the container has to be rescheduled.And then there is a whole new level with HT enabled. I can imagine smth like:
with
dense
andfalse
as defaults.dense
here would mean allow giving a container multiple sibling CPUs.separate
would mean attempt to spread CPUs across non-siblings.spearate_restict
would mean spread CPUs across non-siblings and reserve the sibling, so that it is not assigned to anyone.I'm afraid that I might be over-thinking the whole problem =)
As a small disclaimer I'd like to mention that I come from OpenStack, so my perception migh be tainted with how OS does it =) Here is a link to a doc that describes how OS approaches similar problems.
(cc @ashish-billore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@teferi I like the idea of
corePolicy
.As for
coreThreadPolicy
- isn't it a little bit too much ? i.e. not all environments are homogeneous, some nodes may not have HT enabled (probably not much). Besidesseparate_restrict
can be accomplished by requesting i.e. 2 cores. If node has HT andthen it would result in 2 CPUs on 1 physical core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flyingcougar separate_restrict is more about not allowing others to use core's sibling. It probably would be very difficult to implement, since k8s would still think that we have those cores available for scheduling. So anyway it was probably a weird idea.
corePolicy
is about packing the CPUs as tightly as possible on cores in the same socket, whilecoreThreadPolicy
is about packing CPUs as tightly as possible within one core. So imo both of those have their uses. I think I need to draw a couple of diagrams to illustrate these ideas...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @fabiand I guess you should participate in ^^ part of the discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :)
I'm not an expert on this domain, but …
It's pretty clear that we - kubevirt - need the 'strict dense placement' option as a first step.
I actually wonder if additional differentiations are needed: Either you care, and you require an optimal placement, or you don't and get best effort.
Some workloads, like kubevirt (again), might want to manage parts of numa nodes themselfes, in a more fine granular fashion. In those cases it might make sense, to be able to tell the kubelet, that some nodes, sockets, core, should be reserved/exclusive to a specific container. A process can then take those reserved parts and manage them as needed (i.e. in our case, libvirt would take care of managing those nodes).
Long story short, we might want a mechanism to lend nodes to a process.
Another thing is that in future we need - and thus should already think of - to combine the placement with ResourceClasses #782, because in reality you want a process close to the device it uses, to avoid numa-node remote memory access.
Closely related to this is actually also the relationship to huge-pages.
I do not think that we need to solve all of this now, but we should understand the picture, to not create something which will already contradict with what we see coming up.
@berrange @mpolednik thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand there's exactly that topic on the resource management workgroup agenda for this morning. Besides the device binding logic and the hugetlb controller settings, CNI plugins could also take topology into account on a multi-socket system with multiple network interfaces. I hope we can achieve agreement that a solution is needed to unify affinity somehow in the v1.9 or v1.10 releases.
On the topic of the CPU manager itself, policies that operators want can get really complicated. Hopefully we can get a good default scoped and implemented first and then extend with more advanced policies. In the resource management face-to-face where this component was conceived, we had a pretty long discussion about implementing a more flexible policy driven by explicit pool configuration. Maybe we could think of ways to represent the behaviors needed for the advanced cases in the config domain. In my mind, a rich configurable policy may buy enough flexibility so that we don't need another plugin system for external policies. At the same time, for v1.8 at least, changes to the pod spec API are probably out of scope.