-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add New Resource API proposal #782
Add New Resource API proposal #782
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
// +patchMergeKey=key | ||
// +patchStrategy=merge | ||
Key string | ||
// Example 0.1, intel etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are example of Values
. For example if Key
is 'version', Values
can have '0.1' OR Key
is 'vendor', Values
can have 'intel'. I will make it more clear.
ResourceSelectorOpExists ResourceSelectorOperator = "Exists" | ||
ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist" | ||
ResourceSelectorOpGt ResourceSelectorOperator = "Gt" | ||
ResourceSelectorOpLt ResourceSelectorOperator = "Lt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned Equal To
on line 43 but no Eq
operator here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will correct at line 43 to use "In" in place of "Equal to"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using In
for Eq
causes confusion.
* User can view the current usage/availability details about the resource class using kubectl. | ||
|
||
### User story | ||
Admin knows what all devices are present on nodes and have deployed corresponding device plugins. Device plugins will make devices appear in node status. Next admin creates resource classes that have generic/portable names and metadata which can select available devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly rephrased this paragraph.
The administrator as deployed device plugins to support hardware present in the cluster. The Kubelet will update its status indicating the presence of this hardware via those device plugins. To offer this hardware to applications deployed on kubernetes in a portable way, the administrators creates a number of resource classes to represent that hardware. These resource classes will include metadata about the device and selection criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will update!
5. The user deletes the pod or the pod terminates | ||
6. Kubelet reads pod object annotation for devices consumed and calls `Deallocate` on the matching Device Plugins | ||
|
||
The scheduler is incharge of both, selecting a node and also selecting a device for requested resource classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to node selection, the scheduler is also responsible for selecting a device that matches the resource class requested by the user.
* Select device at pod admission while applying predicates and change all api interfaces that are required to pass selected device to container runtime manager. | ||
* Create resource consumption state again at container runtime manager and select device. | ||
|
||
None of the above approach seems cleaner than doing device selection at scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth mentioning here that this decision helps to retain a clean abstraction between the container runtime and kubernetes? ISTR that was a portion of the discussion.
|
||
## Future Scope | ||
* RBAC: It can further be explored that how to tie resource classes with RBAC like any other existing API resource objects. | ||
* Nested Resource Classes: In future device plugins and resource classes can be extended to support the nested resource class functionality where one resource class could be comprised of a group of sub-resource classes. For example 'numa-node' resource class comprised of sub-resource classes, 'single-core'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to communicate the plans/decisions on how ResourceClasses relate to Opaque Integer Resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whatever existing resource discovery tools are available which update node objects with OIR, those will adapt to update node status devAllocatable
with devices instead. Will add more details.
a3d30c3
to
c7f5820
Compare
@vikaschoudhary16 I don't see any mention of overlapping resources is that something you plan to address in another PR ? |
@RenaudWasTaken It is explained implicitly by explaining how resource class can select device with k-v metadata and how a resource class can select different devices. Overlapping is essentially the functionality which is providing portability. |
I'm also interested in how you solved selecting multiple overlapping resource class because it is not a trivial problem.
In this example if your selection algorithm is a first fit then it is very possible that you won't be able to satisfy the request because you might give the gpu with 8G to the first container. It seems to me that your only solution is to generate all the possible permutations but that doesn't scale well... What do you think ? |
Thanks i missed to add these details though had in mind. This proposal is implementing a first fit selection process. For better resource usage and scalability, in future, selection algorithm will be optimized on the lines you mentioned. |
name: nvidia.high.mem | ||
spec: | ||
resourceSelector: | ||
- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a superfluous line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand this is yaml syntax for nested sequences, https://learn.getgrav.org/advanced/yaml#sequences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh obviously …
Would pulling matchExpressions
into line 89 look saner`
- matchExpressions:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure, will update. Thanks!
|
||
## Motivation | ||
Compute resources in Kubernetes are represented as a key-value map with the key | ||
being a string and the value being a 'Quantity' which can (optionally) be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to OIRs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand yes.
resourceSelector: | ||
- | ||
matchExpressions: | ||
- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something missing here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand same as above.
A couple of drive-by comments:
|
@saad-ali
This proposal's scope is bounded by device plugin proposal's scope, which is to cover devices only and not resources such as CPU and memory.
In scheduler crash case, for any unscheduled pods which request the deleted resource class, scheduling will fail at predicate validation. Similarly predicate will also fail at kubelet because for any new pod kubelet creates resource consumption state from the beginning.
|
- "1G" | ||
``` | ||
Above resource class will select all the hugepages with size greater than | ||
equal to 1 GB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/greater than equal to/greater than
values: | ||
- "nic" | ||
key: "speed" | ||
operator: "In" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean either Eq
or Gt
instead of In
?
- "40GBPS" | ||
``` | ||
Above resource class will select all the NICs with speed greater than equal to | ||
40 GBPS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should change based on the change above.
2. Iterate over all existing nodes in cache to figure out if there are devices | ||
on these nodes which are selectable by resource class. If found, update the | ||
resource class availability status in local cache. | ||
3. Patch the status of resource class api object with availability state in locyy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/locyy/local
ResourceSelectorOpExists ResourceSelectorOperator = "Exists" | ||
ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist" | ||
ResourceSelectorOpGt ResourceSelectorOperator = "Gt" | ||
ResourceSelectorOpLt ResourceSelectorOperator = "Lt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using In
for Eq
causes confusion.
In addition to node selection, the scheduler is also responsible for selecting a | ||
device that matches the resource class requested by the user. | ||
|
||
### Reason for not preferring device selection at kubelet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be change this to Preferring device selection in the scheduler (instead of the Kubelet)
.
device that matches the resource class requested by the user. | ||
|
||
### Reason for not preferring device selection at kubelet | ||
Kubelet does not maintain any cache. Therefore to know the availability of a device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/to know the availability of a device/to know the quantity of the device available for scheduling/
### Reason for not preferring device selection at kubelet | ||
Kubelet does not maintain any cache. Therefore to know the availability of a device, | ||
will have to calculate current total consumption by iterating over all the admitted | ||
pods running on the node. This is already done today while running predicates for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be rephrase as the quantity of the device already consumed should be calculated by iterating
.
consumption state that is created at runtime for each pod, are exactly same, | ||
current api interfaces does not allow to pass selected device to container manager | ||
(where actually device plugin will be invoked from). This problem occurs because | ||
devices are determined internally from resource classes while other resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not opposite (i.e., resource classes are determined from devices)?
From the events perspective, handling for the following events will be added/updated: | ||
|
||
### Resource Class Creation | ||
1. Init and add resource class info into local cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/Init/Initialize
## Opaque Integer Resources | ||
This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature) | ||
(OIR). External agents can continue to attach additional 'opaque' resources to | ||
nodes, but the special naming scheme that is part of the current OIR approach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/but the special naming scheme that is part of the current OIR approach will no longer be necessary/using device plugins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@balajismaniam
not necessarily device plugins, OIRs could also be used same as it is used today(without device plugins). Using OIR with device plugins will be case where plugin has not adapted device advertisement as per device plugin proposal and use OIRs to advertise resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks.
plugins. | ||
|
||
## Opaque Integer Resources | ||
This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/supercede/supersede
*Resource Class* is a new type, objects of which provides abstraction over | ||
[Devices](https://github.com/RenaudWasTaken/community/blob/a7762d8fa80b9a805dbaa7deb510e95128905148/contributors/design-proposals/device-plugin.md#resourcetype). | ||
A *Resource Class* object selects devices using `matchExpressions`, a list of | ||
(operator, key, value). A *Resource Class* object selects a device if atleast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/atleast/at least
In addition to node selection, the scheduler is also responsible for selecting a | ||
device that matches the resource class requested by the user. | ||
|
||
### Reason for not preferring device selection at kubelet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section was a little confusing to read (what exactly does device selection mean?) -- maybe a concrete example would help?
Was the option considered where the scheduler is responsible for mapping resource classes from the container spec to resource names from the node capacityV2, but assigning specific devices is left to the Kubelet? IIRC a reason to delay device binding to Kubelet was to avoid publishing hardware topology to the API server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ConnorDoyle
I have updated several initial sections of the doc to make it more clear. Device selection means selecting a device using the resource class details which includes applying different operators like explained in 'Resource Class' section of this document.
scheduler is responsible for mapping resource classes from the container spec to resource names from the node capacityV2
Yes. and this section notes down challenges in that approach. I have updated this section to make it more clear. Hope it is more understandable now.
Kubelet was to avoid publishing hardware topology to the API server.
This proposal assumes that device details are updated in node status by vendor device plugins, as proposed in device plugin proposal.
1. Get the requested resource class name and quantity from pod spec. | ||
2. Select nodes by applying predicates according to requested quantity and Resource | ||
class's state present in the cache. | ||
3. On the selected node, select a Device from the stored devices info in cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to say something like "concrete resource" here instead of "Device"? There's no limitation that resource classes can only represent devices right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there is a limitation. Resource class would be able to represent only the devices which are advertised by device plugins in Device structure format in the node status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vikaschoudhary16 But devices may have the same name (device-id) across nodes:
- Node1 have dev-1 which satisfied ResourceClassA
- Node2 have dev-1 which also satisfied ResourceClassA
How do we cache the devices info in ResourceClassA? use this pattern: Node1-dev-1
and Node1-dev-1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A deviceinfo structure will be maintained in schedulercache. So each device will have a list of what all resource class it is satisfying.
Take a look: https://github.com/vikaschoudhary16/kubernetes/pull/3/files#diff-558cb8bde14dca10a3151bfc222a3aae
be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files) | ||
|
||
## Resource Class | ||
*Resource Class* is a new type, objects of which provides abstraction over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/provides/provide
|
||
1. A user submits a pod spec requesting 'X' resource classes. | ||
2. The scheduler filters the nodes which do not match the resource requests. | ||
3. scheduler selects a device for each resource class requested and annotates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annotates the pod object
Do you have an example of what the annotation might look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ConnorDoyle
sure, will update. Thanks!
f6d6c95
to
53c2a80
Compare
Thanks @ConnorDoyle @saad-ali @balajismaniam @RenaudWasTaken @fabiand for the review comments. |
extended to support the nested resource class functionality where one resource | ||
class could be comprised of a group of sub-resource classes. For example 'numa-node' | ||
resource class comprised of sub-resource classes, 'single-core'. | ||
* Multiple device selection algorithms, each with a different selection strategy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be thoroughly discussed and not just a "side note" on the bottom of the design doc.
Maybe a sig-scheduling discussion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RenaudWasTaken totally agree. Scheduler section is also updated about "first fit" approach.
Example use case you quoted can be handled by "best fit" approach, which I thought to cover in follow-up proposal. And meanwhile if a user dont want to use "first fit", he/she can request devices using OIRs and can bypass resource classes. Thats how i am thinking. Anyways if community thinks differently, happy to discuss and adapt the proposal accordingly.
/sig scheduling |
i will prioritize reviewing this further when device plugins are agreed upon. |
`scheduler.alpha.kubernetes.io/resClass_test-res-class_nvidia-tesla-gpu=4` | ||
where `scheduler.alpha.kubernetes.io/resClass` is the common prefix for all the | ||
device annotations, `tes-res-class` is resource class name, | ||
`nvidia-tesla-gpu` is the selected device name and `4` is the quantity requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vikaschoudhary16 OK, I see here is how the user would request an amount of one of these resources. I still think that combining the device with the resource class in the selector is going to be problematic, though.
Let me ask something... do you envision the user knowing about the devices on nodes that are providing a particular resource class? Put another way... would (should?) the proposed end user that is constructing the pod spec and specifying the resource requirement selector scheduler.alpha.kubernetes.io/resClass_test-res-class_nvidia-tesla-gpu=4
actually know that the "test-res-class" resources were being provided by the "nvidia-tesla-gpu" device(s)?
My guess is that, often, the end user won't know about the underlying device that is providing some amount of abstracted resources. The deployer of the cloud infrastructure knows that information, of course. But the end user that is consuming some of those resources won't necessarily know the exact vendor/device information and deployers may not actually want the end user to know the device/vendor/model specifics :-)
To respond to your comment from above that we should think of resource classes in the same vein as EC2 instance types, I'd just comment that Amazon has complete and total control over the information it gives users with regards to the amount of resources and the types of capabilities that its instance types comprise. Amazon can change (and has changed in the past) its mind about the quantities of resources and quality/vendor models associated with a particular instance type. If Kubernetes continues to be agnostic to cloud infrastructure providers, I think two things are necessary:
- The way that Kubernetes advertises resources and capabilities to end users should hide underlying device implementation details as much as possible. This includes hiding vendor information as much as possible.
- All concepts that provide a grouping/coupling mechanism between qualitative and quantitative things should be possible to specify as their individual quantitative and qualitative components.
The second point probably deserves a little more explanation. What I mean is that if you are going to expose a concept like a ResourceClass object -- something that allows the deployer to describe a collection of consumed resource amounts as well as capabilities that describe one or more providers of those consumable resources -- then the end user should be able to request a pod consumes those coupled resources and lands on a node with those capabilities without using the coupled object.
In other words, if you have a ResourceClass that looks like this:
kind: ResourceClass
metadata:
name: gpu.high.mem
spec:
resourceSelector:
- matchExpressions:
- key: "gpu.vendor"
operator: "In"
values:
- "nvidia"
- "intel"
- key: "gpu.memory"
operator: "GtEq"
values:
- "4G"
the end user should be able to request a pod where container in the pod need 1 gpu.high.mem
OR the end user should be able to request a pod where containers in the pod need 4G of resource type gpu.memory
and the "gpu.vendor" annotation/selector is "intel" instead of either "intel" OR "nvidia". That's what I mean about breaking the coupling down into its finest-grained representation and allowing the end user to specify that fine-grained request.
Hope that makes sense! I recognize that sometimes, the terminology I use is overlapping and confusing, so I apologize in advance about that. I'm trying to bridge the terminology differences between the OpenStack infrastructure representation of these things with the proposed Kubernetes representation of similar ideas.
Best,
-jay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way that Kubernetes advertises resources and capabilities to end users should hide underlying device implementation details as much as possible. This includes hiding vendor information as much as possible.
Thats upto the deployer. Deployer is free to not use any keys in the resource class which he thinks, user should not know about. Proposal does not make it mandatory to create resource classes with any vendor specific details.
All concepts that provide a grouping/coupling mechanism between qualitative and quantitative things should be possible to specify as their individual quantitative and qualitative components.
I think i understood what your point is. Problem with this approach is that it will become identity mapping. Resource classes are aimed at creating broader abstractions where arbitrary ranges for resource properties is also able to be supported. There are two main problems:
- Portability gone: With current approach, resource class is an allocatable unit. So though its providing a broader abstraction but still its consumed, capacity and remaining units are countable. By this, admin knows, what is the quota which is being offered. If user is left free to choose the range of resource property, like gpu.memory gt 30, one cluster may support but other may not. But if resource classes are treated as a single allocated unit with non-mutated properties, it is easier for admins to control the availability of it as a resource across clusters.
- We cant expect end user to know that much about the device properties. There will be hell more chances of mis-configuration.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, i'm interested in resource classes and devices mapping, I think it is the key point of this proposal.
cluster. Device plugins, running on nodes, will update node status indicating | ||
the presence of this hardware. To offer this hardware to applications deployed | ||
on kubernetes in a portable way, the administrator creates a number of resource | ||
classes to represent that hardware. These resource classes will include metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation.
the presence of this hardware. To offer this hardware to applications deployed | ||
on kubernetes in a portable way, the administrator creates a number of resource | ||
classes to represent that hardware. These resource classes will include metadata | ||
about the devices as selection criteria. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example how resource classes know which devices are selectable? AFAIK, as mentioned above, the administrator creates a resource class without any devices info or node info, who is responsible to do the mapping (ResourceClass --> Devices) work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduler will watch apiserver for resource class objects creation and then scheduler already has nodeinfo in the cache. For each kind of device, a deviceinfo structure will be instantiated and list of device info will reside in nodeinfo. At each resource class creation, scheduler will iterate over the deviceinfo list and if resourceclass matches the device, deviceinfo's list of resourceclass references is updated.
For more detalis take a look at PoC: https://github.com/vikaschoudhary16/kubernetes/pull/3/files#diff-558cb8bde14dca10a3151bfc222a3aae
1. Initialize and add resource class info into local cache | ||
2. Iterate over all existing nodes in cache to figure out if there are devices | ||
on these nodes which are selectable by resource class. If found, update the | ||
resource class availability status in local cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vikaschoudhary16 I'm very interested in the details about this. As commented above, how does Resource Class know which devices is selectable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does above reply answer this question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks!
on the matching Device Plugins | ||
|
||
In addition to node selection, the scheduler is also responsible for selecting a | ||
device that matches the resource class requested by the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how scheduler know which device matches the resource class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, now that device plugin has got merged, i need to update this proposal for remaining enhancements of device plugin for resource classes.
In the ListWatch() response, device plugin will send device properties also in the arbitrary key-value map. Using this, a Device api object will be created by device manager. Scheduler will keep watching api server for any new Device object creation and will keep synching its device info in its cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, sound like that it depends on device properties which exposed by device plugin. So we can't support any matchExpressions
that can not be understood by device properties, am i right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. matchExpressions must have a subset of properities exposed by device plugin.
|
||
## Motivation | ||
Kubernetes system knows only two resource types, 'CPU' and 'Memory'. Any other | ||
resource can be requested by pod using opaque-integer-resource(OIR) mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should say ExtendedResource now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. As we discussed on slack, will iterate.
```yaml | ||
kind: ResourceClass | ||
metadata: | ||
name: fast.nic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side note that the device manager today does not support network interfaces. Thus - as resource classes build upon device manager - this would not work.
However, I'd love to see that the device manager is getting extended to support this one-off case of network resources as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a key cross-sig deliverable. We have discussed it at length in the RMWG and it was deferred because we wanted it to be led by sig-network (or more likely the new Network Plumbing WG).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that we start with a simple model to support high-performance nic with a combination of CNI plugin and device plugin? The CNI plugin takes care of network interface setup and management, and can be portable across different container orchestration systems. The device plugin can run as a sidecar container and take care of device initialization, health monitoring, and resource advertising. I know having device plugin act like a middleman between CNI and Kubelet may pose certain limitations. We can discuss them and see whether we may solve them by enriching the information passed between device plugin and Kubelet. However, at least for now, I hope the information passed between device plugin and Kubelet can stay at resource level (such as resource name and properties) and device level (such as device runtime configuration). This way, it is clear at the API level that Kubelet is the central place in charge of resource allocation and container runtime setup, which I feel perhaps will be easier to be extended to support future features like cross-resource affinity allocation and etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the sig have enough resource to implement this case, I would like to see this design keep as simple as possible and hopefully get merged in 1.10 cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, thanks for the info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiayingz technically it can also be done closer to CNI. To me it just looks like it will be a much cleaner interface to provide the network connectivity via the device manager. This will effectively formalize how components can add additional NICs to a pod. Today this is unspecified and everybody is pretty much using CNI in some way to achieve this.
The problem is that there are so many different assumptions by different projects. Eventually it even implies changing the CNI config of the host which is not necessarily desireable.
OTOH if we could use the device manager, then we know a way how to distribute such a plugin and how it would operate, which could eventually lead to more cooperation, and to less wild west.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to see this POC'd as a completely pluggable concept addition.
* If I request a resource `nvidia.gpu.high.mem` for my pod, any 'nvidia-gpu' | ||
type device which has memory greater than or equal to 'X' GB, should be able | ||
to satisfy this request, independent of other device capabilities such as | ||
'version' or 'nvlink locality' etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting too selective on non-fungible resources can be a dangerous game vs. channeling a lowest common denominator.
For a decade on grid systems we allow arbitrary matching on any attribute via an expressive configuration language, and it was eventually highly abused by its users to hoard the prized resources.
In addition to node selection, the scheduler is also responsible for selecting a | ||
device that matches the resource class requested by the user. | ||
|
||
### Reason for preferring device selection at the scheduler and not at the kubelet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a user requests ER in pods that not handled by scheduler(daemonset pods or static pods)?
@vikaschoudhary16 knock knock :) As @timothysc suggested, maybe we can arrange a PoC? Any free bandwidth? |
classes which could select this device in the cache. | ||
5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`. | ||
6. Add the pod reference in local DeviceToPod mapping structure in the cache. | ||
7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to know the opinion of some sig-scheduling folks on the best way to achieve that.
Sending a patch request during host selection for a pod seem a bit different from the current scheduling implementation.
Maybe @bsalamat or @timothysc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending a patch request directly here is not right.
But we can probably update the annotation of assumedPod
, and update Pod api object during bind()
, which is async.
@bsalamat Make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally do not like the idea of scheduler knowing about mapping of devices to resource classes. This does not fit well in scheduler's logic and does not seem to be something that scheduler should be aware of.
It is also not clear why scheduler needs to update the pod with the selected device annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat The problem I can see here is: device plugin is not expected to have any scheduling logic (e.g. select devices) by design, and if scheduler also does not want to have it. The only place to select devices will be kubelet. But what this proposal is describing requires awareness of whole picture of all devices info in the cluster to make decision, this is not what one kubelet is capable of (it only has device info of this node).
I am willing to pushing this design as it do help to enable GPU topology in Kubernetes a lot, and also, fixes blockers for other devices like FPGA. So would be great to know your ideas for scheduling.
Do you mind adding a pod spec example to your design document? I'm fairly certain Resource classes are supposed to be requested in the resources field but It's probably better to have explicit confirmation :) After discussing it a bit internally and since this is going to be discussed at the Face 2 Face, we think it might be a good idea to discuss device sharing ideas in this proposal as sharing might have a significant impact on the technical implementation. There are two kinds of device sharing:
simple sharing Seem to be something that might be expressed as a construct on top of the ResourceClass API and built into the podSpec exactly like other sharing APIs such as volumes. apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
initContainers:
- name: myInitContainer1
image: "nvidia/cuda"
resources:
limits:
devices: ["nvidia-gpu"]
containers:
- name: myInferenceContainer1
image: "nvidia/cuda"
resources:
limits:
devices: ["nvidia-gpu"]
- name: myInferenceContainer2
image: "nvidia/cuda"
resources:
limits:
devices: ["nvidia-gpu"]
devices:
- name: "nvidia-gpu"
resources:
nvidia.high.mem: 1
---
kind: ResourceClass
metadata:
name: nvidia.high.mem
spec:
resourceSelector:
- matchExpressions:
- key: "Kind"
operator: "In"
values:
- "nvidia-gpu"
- key: "memory"
operator: "GtEq"
values:
- "30G" Complex sharing Coud be expressed by adding labels to the devices:
- name: "nvidia-gpu"
labels: ["nvidia.com/MPS"]
resources:
nvidia.high.mem: 1 |
classes which could select this device in the cache. | ||
5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`. | ||
6. Add the pod reference in local DeviceToPod mapping structure in the cache. | ||
7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally do not like the idea of scheduler knowing about mapping of devices to resource classes. This does not fit well in scheduler's logic and does not seem to be something that scheduler should be aware of.
It is also not clear why scheduler needs to update the pod with the selected device annotation.
1. Discovery, advertisement, allocation/deallocation of devices is expected to | ||
be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files) | ||
|
||
## Resource Class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skimmed over the doc. I think any resource class proposal should cover both node resources and cluster resources. This proposal does not cover the latter.
I have been lurking for a while to see how this shapes up. This proposal seems aimed at only at selecting devices based on metadata. There is certainly a need to model resources in K8s but the current proposal needs to be enhanced in many ways to reflect the variety of devices, usage models and use cases. Let us start by stating what we would want from a resource model:
To crystallize the ideas above, consider this FPGA card currently in the market [*]. It can be abstracted as below:
How would such a device be represented and handled in this proposal? I think the scope needs to be broadened considerably. [*] This is only an example. I am not affiliated with, or own stock in, the company selling this product. |
@RenaudWasTaken In your example of complex sharing, do users have to be aware of MPS? Is it possible that device plugin talks to MPS daemon to setup per-container GPU resource limit, without users knowing about MPS? |
53c2a80
to
7064970
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a2ff1dc
to
adb3da2
Compare
Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>
adb3da2
to
814999c
Compare
ComputeResource objects are similar to PV objects on storage. It is tied to the physical resource that can have a wide range of vendor specific properties. Kubelet will create or update ComputeResource objects upon any resource availability or property change for node-level resources. Once a node is configured to support ComputeResource API and the underlying resource is exported as a ComputeResource, its quantity should NOT be included in the conventional NodeStatus Capacity/Allocatable fields to avoid resource multiple counting. ComputeResource API can be included into NodeStatus to facilitate resource introspection. | ||
For cluster level resources, a special controller or a scheduler extender can create a ComputeResource and dynamically bind that to a node during or after scheduling. | ||
|
||
### ResourceClass API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the users use ResourceClass directly in their pods ? How about letting them create ComputeResourceClaims(CRCs) and use CRCs in their pods just like PV/PVC does ? Users can express their requirements in CRCs and scheduler(or some controller) can match CRCs and CRs.
And using ResourceClass for auto provision if necessary ?
Created a KEP, #2265, for this, so closing this one. |
Notes for reviewers
First proposal submitted to the community repo, please advise if something's not right with the format or procedure, etc.
cc @aveshagarwal @jeremyeder @derekwaynecarr @vishh @jiayingz
Signed-off-by: vikaschoudhary16 vichoudh@redhat.com