-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal changes for partitionable devices in DRA evolution prototype #27
Minimal changes for partitionable devices in DRA evolution prototype #27
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnbelamaric The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// Must not have more than 32 entries. | ||
// | ||
// +optional | ||
SharedAllocatableConsumed map[string]resource.Quantity `json:"sharedAllocatableConsumed,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a level of indirection here instead of just a map of Quantity
? Otherwise we can never use anything other than a quantity in ResourceCapacity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably, see the PR comment notes: #27 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love your opinion specifically on the last question - pool-per-gpu or something else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like the idea of grouping them with names. It means I'm not forced to have one GPU per ResourceSlice if I don't want to. This will likely be important for "simpler" devices that only partition on, say, one dimention.
For example with this, I can have the following in a single ResourceSlice (using a newly proposed name for this field):
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlice
metadata:
name: <uuid>
spec:
driver: gpu.nvidia.com
poolName: node0
nodeName: node0
sharedCapacity:
- name: gpu-0-shared-capacity
resources:
- name: memory
capacity: 40Gi
- name: compute
capacity: 180
- name: gpu-1-shared-capacity
resources:
- name: memory
capacity: 40Gi
- name: compute
capacity: 180
devices:
- name: gpu-0
sharedCapacityConsumed:
- name: gpu-0-shared-capacity
resources:
- name: memory
capacity: 40Gi
- name: compute
capacity: 180
- name: gpu-0-first-half
sharedCapacityConsumed:
- name: gpu-0-shared-capacity
resources:
- name: memory
capacity: 20Gi
- name: compute
capacity: 90
- name: gpu-0-second-half
sharedCapacityConsumed:
- name: gpu-0-shared-capacity
resources:
- name: memory
capacity: 20Gi
- name: compute
capacity: 90
- name: gpu-1
sharedCapacityConsumed:
- name: gpu-1-shared-capacity
resources:
- name: memory
capacity: 40Gi
- name: compute
capacity: 180
- name: gpu-1-first-half
sharedCapacityConsumed:
- name: gpu-1-shared-capacity
resources:
- name: memory
capacity: 20Gi
- name: compute
capacity: 90
- name: gpu-1-second-half
sharedCapacityConsumed:
- name: gpu-1-shared-capacity
resources:
- name: memory
capacity: 20Gi
- name: compute
capacity: 90
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also makes it more flexible such that one device could pull from both the shared resources of a parent GPU and some (possibly) off-chip shared resources that all devices have access to (regardless of their parent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I like this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klueska I pushed an update with this, PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patrick convinced me to go back to a flat list, which can accomplish the same with:
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlice
metadata:
name: <uuid>
spec:
driver: gpu.nvidia.com
poolName: node0
nodeName: node0
sharedCapacity:
- name: gpu-0-shared-memory
capacity: 40Gi
- name: gpu-0-shared-compute
capacity: 180
- name: gpu-1-shared-memory
capacity: 40Gi
- name: gpu-1-shared-compute
capacity: 180
devices:
- name: gpu-0
sharedCapacityConsumed:
- name: gpu-0-shared-memory
capacity: 40Gi
- name: gpu-0-shared-compute
capacity: 180
- name: gpu-0-first-half
sharedCapacityConsumed:
- name: gpu-0-shared-memory
capacity: 20Gi
- name: gpu-0-shared-compute
capacity: 90
- name: gpu-0-second-half
sharedCapacityConsumed:
- name: gpu-0-shared-memory
capacity: 20Gi
- name: gpu-0-shared-compute
capacity: 90
- name: gpu-1
sharedCapacityConsumed:
- name: gpu-1-shared-memory
capacity: 40Gi
- name: gpu-1-shared-compute
capacity: 180
- name: gpu-1-first-half
sharedCapacityConsumed:
- name: gpu-1-shared-memory
capacity: 20Gi
- name: gpu-1-shared-compute
capacity: 90
- name: gpu-1-second-half
sharedCapacityConsumed:
- name: gpu-1-shared-memory
capacity: 20Gi
- name: gpu-1-shared-compute
capacity: 90
I will add it in the KEP as such.
Partionable devices, along with driver-side magic, can also support the idea of "compound devices". Here's how it would work. Suppose we have two drivers, one for GPUs and one for NICs. We have nodes with 8 GPUs and 4 NICs. We want to allow certain "valid" combinations of these to be consumed as a unit. The "rules" here are:
This implies the following valid "molecules" for the triplet (gpu0, gpu1, nic0):
Similarly there are 5 valid combinations for other triplet. So, what we do is create a "compound device" driver that runs on the node but acts as an intermediary between the K8s control plane and the drivers for the underlying devices. It contains an in-process mini API server that serves the ResourcePool API, and we point the GPU and NIC drivers at that local instance. The compound device driver uses those to construct a new compound pool on top of those drivers that follows the rules above, using this partitionable model: apiVersion: resource.k8s.io/v1alpha3
kind: ResourcePool
metadata:
name: node0-compound0
spec:
driver: compound.example.com
nodeName: node0
devices:
- name: gpu0
sharedConsumed:
gpu0: 1
- name: gpu0-nic0
sharedConsumed:
gpu0: 1
nic0: 1
- name: gpu1
sharedConsumed:
gpu1: 1
- name: gpu1-nic0
sharedConsumed:
gpu1: 1
nic0: 1
- name: gpu0-gpu1-nic0
sharedConsumed:
gpu0: 1
gpu1: 1
nic0: 1
...
sharedConsumable:
- name: gpu0
capacity: 1
- name: gpu1
capacity: 1
- name: gpu2
capacity: 1
- name: gpu3
capacity: 1
- name: gpu4
capacity: 1
- name: gpu5
capacity: 1
- name: gpu6
capacity: 1
- name: gpu7
capacity: 1
- name: nic0
capacity: 1
- name: nic1
capacity: 1
- name: nic2
capacity: 1
- name: nic3
capacity: 1 The compound device driver is the only one that actually publishes anything to the K8s control plane. It is also what kubelet makes calls to, and it in turn calls down to the other drivers. There are lots of details to work out for this, of course. For example, ideally users don't need to know they are using this intermediary, except maybe based on the class they choose. This would mean that the CEL-based attributes they use should still be the ones used by the underlying devices, rather than some that are particular to the compound device driver (which also may have some). For that, we may need to make sure that attributes are qualified, always, rather than allowing the short-hand of "unqualified means from the driver". Otherwise I can see a lot of confusion, especially during copy-and-paste situations. There are also a few limitations:
|
I'm not sold on the complex-device scenario you proposed here, but I think we could iterate on that later. |
// SharedCapacity is a set of resources which can be drawn down upon. | ||
type SharedCapacity struct { | ||
// Name is the name of this set of shared capacity, which is unique | ||
// resource pool. | ||
// +required | ||
Name string `json:"name"` | ||
|
||
// Resources are the list of resources provided in this set. | ||
Resources []ResourceCapacity `json:"resources,omitempty"` | ||
} | ||
|
||
// SharedCapacityRequest is a request to draw down resources from a particular | ||
// SharedCapacity. | ||
type SharedCapacityRequest struct { | ||
|
||
// Name is the name of the SharedCapacity from which to draw the | ||
// resources. | ||
// | ||
// +required | ||
Name string `json:"name"` | ||
|
||
// Resources are the list of resources and the amount of resources | ||
// to draw down. | ||
Resources []ResourceCapacity `json:"resources,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you expect these to diverge at some point? Right now they are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. I considered making them the same, but since their usage is different I kept them separate. I don't care which way we do it, whatever API reviewers prefer.
// Name is the resource name/type. | ||
// +required | ||
Name string `json:"name"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume all of the Name
fields should be a DNS label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there mismatch between the type names? Above it's SharedCapacity
, here it's ResourceCapacity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am flexible on the various names. SharedCapacity
is a named list of ResourceCapacity
objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presenting structs in the order in which they are needed would help.
The first change is adding SharedCapacity []SharedCapacity
, so here I would expect the definition of that SharedCapacity
struct, not the ResourceCapacity
that is currently here.
If I am following the structs correctly, SharedCapacity
is a map[name1]map[name2]quantity
, where both maps are lists due to how we do such maps in Kubernetes.
Can't this be flattened to map[name]quantity
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume all of the Name fields should be a DNS label?
Can we make the validation identical to the identifier part of an attribute name? That's merely for consistency, not because the two are related. As explained elsewhere, these shared capacity names are not exposed to users.
@@ -64,12 +64,33 @@ type ResourcePoolSpec struct { | |||
// Must not have more than 128 entries. | |||
Devices []Device `json:"devices,omitempty"` | |||
|
|||
// SharedCapacity are pooled resources that are shared by all | |||
// devices in the pool. This is typically used when representing a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the pool as defined in kubernetes/enhancements#4709 or the object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field would be directly in the ResourceSlice. The description is out of date. It's a shared set of resources, with a name, that devices in the Slice reference / consume when allocated. The are local to this specific slice and only consumable by devices listed in this slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This becomes relevant when considering a future "reservation" protocol. Reserving a device and the corresponding shared capacity will be a lot easier if in the same object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to reserve the shared capacity -- just the device (as before). By reserving a device, you implicitly decrease the amount of resources available in the shared capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking towards a future protocol for allocation when there are multiple schedulers in a cluster which need to allocate the same devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in the pool" should read "in the ResourceSlice", so they are the same object, yes!
You get a whole device or not, that is what determines the subtraction from the shared set. It doesn't matter if there are multiple schedulers or not; they just need to coordinate on the device allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, you're right. In that case, they would need to know how much of the overall shared capacity has already been depleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, so long as we can see the list of all devices that have already been reserved
, then each scheduler can do the math to track the depleted capacity independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's exactly how we do it, this is the same with and without shared resources.
// resources. | ||
// | ||
// +required | ||
Name string `json:"name"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show some YAML examples how this is going to be used? How is potential name ambiguity resolved once someone asks for for some kind of device which could be provided by different vendors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is that this may depend on the same naming convention as for device attributes and also shouldn't overlap with those, i.e. "memory" in the gpu.dra.example.com/
scope can be either an attribute or a consumable resource, but not both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be used like this:
#27 (comment)
It is not referenceable outside of the devices in this particular slice. Meaning it's not visible to CEL expressions. If you want it to be visible in a CEL expression, you need to duplicate it as an attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was confused by the "request" part of the name. This isn't something that users put into a claim to "request" this amount.
Do users ever need to interact with this in a claim? How do they ask for "a GPU with at least 10Gi memory"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overlapping comments... I think you answered my question. Still, including YAMLs in the PR would help to clarify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To maybe be more clear -- sharedCapacities
exists for the scheduler, not the end user. The scheduler will track each SharedCapacity
and see what's still available in it, when considering a device for allocation that references it in its SharedCapacityConsumed
. If there are not enough resources available in it (because they have already been consumed by previously allocated devices), then the device is not selectable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will be sure to include YAMLs in the PR when adding text about this to kubernetes/enhancements#4709
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, users never interact with this. It is internal to the device/capacity model. We can use a different word than "Request"...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing changes on the claim side for this. It just helps us keep track of which partitions (devices) are allowed and which are already used up, since they can be indirectly used up by another device using a shared resource.
Yeah, that is 100% on top of this without affecting what this looks like. It's something I want to prototype before too long - but it would be out-of-tree anyway :) |
SGTM
|
The rationale for that:
|
|
||
// Capacity is the total capacity of the named resource. | ||
// +required | ||
Capacity resource.Quantity `json:"capacity"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not using a resource.Quantity
. As pointed out elsewhere, this is not a user-facing API. It's for the scheduler machinery. A simpler Count int64
would be easier to encode and work with.
If there ever is a need for a value < 1, then the driver can scale both capacity and consumed values up by a fixed factor (1000, 1000000).
- capacity: 40Gi | ||
name: gpu-0-memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be a capacity, but it does need to be an attribute so that users can select on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does need to have capacities for all of the memory blocks then though (which is probably more consistent anyway).
- attributes: | ||
- name: uuid | ||
string: GPU-c2fe2ade-955a-421f-976e-3f1dd7254234 | ||
- name: product-name | ||
string: Mock NVIDIA A100-SXM4-40GB | ||
- name: brand | ||
string: Nvidia | ||
- name: architecture | ||
string: Ampere | ||
- name: cuda-compute-capability | ||
version: 8.0.0 | ||
- name: driver-version | ||
version: 550.54.15 | ||
- name: cuda-driver-version | ||
version: 12.4.0 | ||
- bool: true | ||
name: mig-capable | ||
- bool: false | ||
name: mig-enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine there to be quite a few more attributes here eventually. Specifically around NVLINKs, PCIERoot complexes, etc. These will also need to be duplicated in each MIG device (unless we build a section for common attributes, grouped by name, that can be referenced by different devices).
Since this is "what's in the KEP" I think we should merge it and rebase all the options on it, so they appear as diffs. But I screwed up and LGTM'ed option 2 (#29) I don't hjave super on this repo, so I cannot manually fix |
cde999c
to
41d297f
Compare
Ok, this matches the KEP. Merging. |
This PR adds the minimal fields needed to support partitionable devices. A few notes for consideration:
SharedAllocatable
for the shared pooled resources, andSharedAllocatableConsumed
for the device values that consume items from the pool.SharedAllocatable
is a[]ResourceCapacity
which is a struct with just name and quantity. This leaves out BlockSize and IntRange stuff to keep it as simple as possible.SharedAllocatableConsumed
is amap[string]resource.Quantity
to mirrorPodSpec
requests. Given that this is the capacity model, not the claim model, consistency may not be needed here. In that case, we can probably change this to a struct instead, which would give us more room for expansion in the future.SharedAllocatable
is directly inResourcePool
, that means each partitionable device needs to be its own pool. We could consider two other options:Devices []Device
it would instead containDeviceGroups []DeviceGroup
where:AllocatableGroup
option as well, since it is incremental. I don't at this point like the last option much.