Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal changes for partitionable devices in DRA evolution prototype #27

Merged

Conversation

johnbelamaric
Copy link
Contributor

This PR adds the minimal fields needed to support partitionable devices. A few notes for consideration:

  • Terminology chosen is SharedAllocatable for the shared pooled resources, and SharedAllocatableConsumed for the device values that consume items from the pool.
  • SharedAllocatable is a []ResourceCapacity which is a struct with just name and quantity. This leaves out BlockSize and IntRange stuff to keep it as simple as possible.
  • SharedAllocatableConsumed is a map[string]resource.Quantity to mirror PodSpec requests. Given that this is the capacity model, not the claim model, consistency may not be needed here. In that case, we can probably change this to a struct instead, which would give us more room for expansion in the future.
  • Max in each list is arbitrarily set to 32.
  • Since SharedAllocatable is directly in ResourcePool, that means each partitionable device needs to be its own pool. We could consider two other options:
    • Create another type for groups of SharedAllocatables. Kevin had this in an earlier version. Then each physical GPU would be in one of those, and the devices would name the group in addition to the pool in their SharedResourceConsumed:
    type ResourcePool struct {
      ...
      Devices []Device
    
      SharedAllocatable []AllocatableGroup
    }
    type AllocatableGroup struct {
      Name string
      Allocatable []ResourceCapacity
    }
    type Device struct {
      ...
      SharedAllocatableConsumed []ResourceRequest
    }
    type ResourceRequest struct {
      AllocatableGroupName string
      ResourceName string
      Quantity resource.Quantity
    }
    • Instead of the pool containing Devices []Device it would instead contain DeviceGroups []DeviceGroup where:
    type DeviceGroup struct {
      SharedAllocatable []ResourceCapacity
      Devices []Device
    }
    • I am sort of OK with the pool-per-GPU, as it is the most minimal change. I could also see going for the AllocatableGroup option as well, since it is incremental. I don't at this point like the last option much.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 6, 2024
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 6, 2024
// Must not have more than 32 entries.
//
// +optional
SharedAllocatableConsumed map[string]resource.Quantity `json:"sharedAllocatableConsumed,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a level of indirection here instead of just a map of Quantity? Otherwise we can never use anything other than a quantity in ResourceCapacity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, see the PR comment notes: #27 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love your opinion specifically on the last question - pool-per-gpu or something else

Copy link
Contributor

@klueska klueska Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like the idea of grouping them with names. It means I'm not forced to have one GPU per ResourceSlice if I don't want to. This will likely be important for "simpler" devices that only partition on, say, one dimention.

For example with this, I can have the following in a single ResourceSlice (using a newly proposed name for this field):

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlice
metadata:
  name: <uuid>
spec:
  driver: gpu.nvidia.com
  poolName: node0
  nodeName: node0
  sharedCapacity:
  - name: gpu-0-shared-capacity
    resources: 
    - name: memory
      capacity: 40Gi
    - name: compute
      capacity: 180
  - name: gpu-1-shared-capacity
    resources: 
    - name: memory
      capacity: 40Gi
    - name: compute
      capacity: 180
  devices:
  - name: gpu-0
    sharedCapacityConsumed:
    - name: gpu-0-shared-capacity
      resources: 
      - name: memory
        capacity: 40Gi
      - name: compute
        capacity: 180
  - name: gpu-0-first-half
    sharedCapacityConsumed:
    - name: gpu-0-shared-capacity
      resources: 
      - name: memory
        capacity: 20Gi
      - name: compute
        capacity: 90
  - name: gpu-0-second-half
    sharedCapacityConsumed:
    - name: gpu-0-shared-capacity
      resources: 
      - name: memory
        capacity: 20Gi
      - name: compute
        capacity: 90
  - name: gpu-1
    sharedCapacityConsumed:
    - name: gpu-1-shared-capacity
      resources: 
      - name: memory
        capacity: 40Gi
      - name: compute
        capacity: 180
  - name: gpu-1-first-half
    sharedCapacityConsumed:
    - name: gpu-1-shared-capacity
      resources: 
      - name: memory
        capacity: 20Gi
      - name: compute
        capacity: 90
  - name: gpu-1-second-half
    sharedCapacityConsumed:
    - name: gpu-1-shared-capacity
      resources: 
      - name: memory
        capacity: 20Gi
      - name: compute
        capacity: 90

Copy link
Contributor

@klueska klueska Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also makes it more flexible such that one device could pull from both the shared resources of a parent GPU and some (possibly) off-chip shared resources that all devices have access to (regardless of their parent).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I like this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska I pushed an update with this, PTAL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patrick convinced me to go back to a flat list, which can accomplish the same with:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlice
metadata:
  name: <uuid>
spec:
  driver: gpu.nvidia.com
  poolName: node0
  nodeName: node0
  sharedCapacity:
  - name: gpu-0-shared-memory
    capacity: 40Gi
  - name: gpu-0-shared-compute
    capacity: 180
  - name: gpu-1-shared-memory
    capacity: 40Gi
  - name: gpu-1-shared-compute
    capacity: 180
  devices:
  - name: gpu-0
    sharedCapacityConsumed:
    - name: gpu-0-shared-memory
      capacity: 40Gi
    - name: gpu-0-shared-compute
      capacity: 180
  - name: gpu-0-first-half
    sharedCapacityConsumed:
    - name: gpu-0-shared-memory
      capacity: 20Gi
    - name: gpu-0-shared-compute
      capacity: 90
  - name: gpu-0-second-half
    sharedCapacityConsumed:
    - name: gpu-0-shared-memory
      capacity: 20Gi
    - name: gpu-0-shared-compute
      capacity: 90
  - name: gpu-1
    sharedCapacityConsumed:
    - name: gpu-1-shared-memory
      capacity: 40Gi
    - name: gpu-1-shared-compute
      capacity: 180
  - name: gpu-1-first-half
    sharedCapacityConsumed:
    - name: gpu-1-shared-memory
      capacity: 20Gi
    - name: gpu-1-shared-compute
      capacity: 90
  - name: gpu-1-second-half
    sharedCapacityConsumed:
    - name: gpu-1-shared-memory
      capacity: 20Gi
    - name: gpu-1-shared-compute
      capacity: 90

I will add it in the KEP as such.

@johnbelamaric
Copy link
Contributor Author

johnbelamaric commented Jun 7, 2024

Partionable devices, along with driver-side magic, can also support the idea of "compound devices". Here's how it would work.

Suppose we have two drivers, one for GPUs and one for NICs. We have nodes with 8 GPUs and 4 NICs. We want to allow certain "valid" combinations of these to be consumed as a unit. The "rules" here are:

  • NICs are not available without a GPU
  • GPUs are available with or without a NIC, but specific GPUs go along with specific NICs, like:
    • gpu0,1 go with nic0
    • gpu2,3 go with nic1
    • gpu4,5 go with nic2
    • gpu6,7 go with nic3

This implies the following valid "molecules" for the triplet (gpu0, gpu1, nic0):

  • gpu0
  • gpu0 + nic0 (leaving gpu1 only available by itself, without a NIC)
  • gpu1
  • gpu1 + nic0 (leaving gpu0 only available by itself, without a NIC)
  • gpu0 + gpu1 + nic0

Similarly there are 5 valid combinations for other triplet.

So, what we do is create a "compound device" driver that runs on the node but acts as an intermediary between the K8s control plane and the drivers for the underlying devices. It contains an in-process mini API server that serves the ResourcePool API, and we point the GPU and NIC drivers at that local instance. The compound device driver uses those to construct a new compound pool on top of those drivers that follows the rules above, using this partitionable model:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourcePool
metadata:
  name: node0-compound0
spec:
  driver: compound.example.com
  nodeName: node0
  devices:
  - name: gpu0
    sharedConsumed:
      gpu0: 1
  - name: gpu0-nic0
    sharedConsumed:
      gpu0: 1
      nic0: 1
  - name: gpu1
    sharedConsumed:
      gpu1: 1
  - name: gpu1-nic0
    sharedConsumed:
      gpu1: 1
      nic0: 1
  - name: gpu0-gpu1-nic0
    sharedConsumed:
      gpu0: 1
      gpu1: 1
      nic0: 1
...
  sharedConsumable:
  - name: gpu0
    capacity: 1
  - name: gpu1
    capacity: 1
  - name: gpu2
    capacity: 1
  - name: gpu3
    capacity: 1
  - name: gpu4
    capacity: 1
  - name: gpu5
    capacity: 1
  - name: gpu6
    capacity: 1
  - name: gpu7
    capacity: 1
  - name: nic0
    capacity: 1
  - name: nic1
    capacity: 1
  - name: nic2
    capacity: 1
  - name: nic3
    capacity: 1

The compound device driver is the only one that actually publishes anything to the K8s control plane. It is also what kubelet makes calls to, and it in turn calls down to the other drivers.

There are lots of details to work out for this, of course. For example, ideally users don't need to know they are using this intermediary, except maybe based on the class they choose. This would mean that the CEL-based attributes they use should still be the ones used by the underlying devices, rather than some that are particular to the compound device driver (which also may have some). For that, we may need to make sure that attributes are qualified, always, rather than allowing the short-hand of "unqualified means from the driver". Otherwise I can see a lot of confusion, especially during copy-and-paste situations.

There are also a few limitations:

  • Underlying devices will only be available as part of a compound device. You cannot make them available both as independent devices managed by their own drivers, and as part of a compound device.
  • Theoretically it would be possible to make a compound device that includes partitions from underlying partitionable devices, but this would likely be pretty difficult and is not something we should make a priority (in fact the compound driver probably should not allow it, at least for the foreseeable future).
  • Similarly, when we add allocatable device resources as a way of creating fractional devices, we probably will want to exclude that from the compound device driver functionality. It may theoretically be possible to support fractional devices in the compound device by a utilizing fractional allocation of underlying devices, but that would be...complex.

@klueska
Copy link
Contributor

klueska commented Jun 11, 2024

I'm not sold on the complex-device scenario you proposed here, but I think we could iterate on that later.
The more important thing is to agree on the API for partitionable devices, and I'm fairly happy with the naming / structure I proposed in my comment here: https://github.com/kubernetes-sigs/wg-device-management/pull/27/files#r1634768662

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 11, 2024
Comment on lines 169 to 193
// SharedCapacity is a set of resources which can be drawn down upon.
type SharedCapacity struct {
// Name is the name of this set of shared capacity, which is unique
// resource pool.
// +required
Name string `json:"name"`

// Resources are the list of resources provided in this set.
Resources []ResourceCapacity `json:"resources,omitempty"`
}

// SharedCapacityRequest is a request to draw down resources from a particular
// SharedCapacity.
type SharedCapacityRequest struct {

// Name is the name of the SharedCapacity from which to draw the
// resources.
//
// +required
Name string `json:"name"`

// Resources are the list of resources and the amount of resources
// to draw down.
Resources []ResourceCapacity `json:"resources,omitempty"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you expect these to diverge at some point? Right now they are identical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. I considered making them the same, but since their usage is different I kept them separate. I don't care which way we do it, whatever API reviewers prefer.

Comment on lines 83 to 85
// Name is the resource name/type.
// +required
Name string `json:"name"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume all of the Name fields should be a DNS label?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there mismatch between the type names? Above it's SharedCapacity, here it's ResourceCapacity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am flexible on the various names. SharedCapacity is a named list of ResourceCapacity objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presenting structs in the order in which they are needed would help.

The first change is adding SharedCapacity []SharedCapacity, so here I would expect the definition of that SharedCapacity struct, not the ResourceCapacity that is currently here.

If I am following the structs correctly, SharedCapacity is a map[name1]map[name2]quantity, where both maps are lists due to how we do such maps in Kubernetes.

Can't this be flattened to map[name]quantity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume all of the Name fields should be a DNS label?

Can we make the validation identical to the identifier part of an attribute name? That's merely for consistency, not because the two are related. As explained elsewhere, these shared capacity names are not exposed to users.

@@ -64,12 +64,33 @@ type ResourcePoolSpec struct {
// Must not have more than 128 entries.
Devices []Device `json:"devices,omitempty"`

// SharedCapacity are pooled resources that are shared by all
// devices in the pool. This is typically used when representing a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the pool as defined in kubernetes/enhancements#4709 or the object?

Copy link
Contributor

@klueska klueska Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field would be directly in the ResourceSlice. The description is out of date. It's a shared set of resources, with a name, that devices in the Slice reference / consume when allocated. The are local to this specific slice and only consumable by devices listed in this slice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes relevant when considering a future "reservation" protocol. Reserving a device and the corresponding shared capacity will be a lot easier if in the same object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to reserve the shared capacity -- just the device (as before). By reserving a device, you implicitly decrease the amount of resources available in the shared capacity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking towards a future protocol for allocation when there are multiple schedulers in a cluster which need to allocate the same devices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in the pool" should read "in the ResourceSlice", so they are the same object, yes!

You get a whole device or not, that is what determines the subtraction from the shared set. It doesn't matter if there are multiple schedulers or not; they just need to coordinate on the device allocation.

Copy link
Contributor

@klueska klueska Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, you're right. In that case, they would need to know how much of the overall shared capacity has already been depleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, so long as we can see the list of all devices that have already been reserved, then each scheduler can do the math to track the depleted capacity independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's exactly how we do it, this is the same with and without shared resources.

// resources.
//
// +required
Name string `json:"name"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show some YAML examples how this is going to be used? How is potential name ambiguity resolved once someone asks for for some kind of device which could be provided by different vendors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that this may depend on the same naming convention as for device attributes and also shouldn't overlap with those, i.e. "memory" in the gpu.dra.example.com/ scope can be either an attribute or a consumable resource, but not both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be used like this:
#27 (comment)

It is not referenceable outside of the devices in this particular slice. Meaning it's not visible to CEL expressions. If you want it to be visible in a CEL expression, you need to duplicate it as an attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was confused by the "request" part of the name. This isn't something that users put into a claim to "request" this amount.

Do users ever need to interact with this in a claim? How do they ask for "a GPU with at least 10Gi memory"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overlapping comments... I think you answered my question. Still, including YAMLs in the PR would help to clarify this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To maybe be more clear -- sharedCapacities exists for the scheduler, not the end user. The scheduler will track each SharedCapacity and see what's still available in it, when considering a device for allocation that references it in its SharedCapacityConsumed. If there are not enough resources available in it (because they have already been consumed by previously allocated devices), then the device is not selectable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be sure to include YAMLs in the PR when adding text about this to kubernetes/enhancements#4709

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, users never interact with this. It is internal to the device/capacity model. We can use a different word than "Request"...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing changes on the claim side for this. It just helps us keep track of which partitions (devices) are allowed and which are already used up, since they can be indirectly used up by another device using a shared resource.

@johnbelamaric
Copy link
Contributor Author

I'm not sold on the complex-device scenario you proposed here, but I think we could iterate on that later. The more important thing is to agree on the API for partitionable devices, and I'm fairly happy with the naming / structure I proposed in my comment here: https://github.com/kubernetes-sigs/wg-device-management/pull/27/files#r1634768662

Yeah, that is 100% on top of this without affecting what this looks like. It's something I want to prototype before too long - but it would be out-of-tree anyway :)

@johnbelamaric
Copy link
Contributor Author

johnbelamaric commented Jun 12, 2024 via email

@pohly
Copy link
Contributor

pohly commented Jun 12, 2024

The rationale for that:

  • The quantities are for machines, not humans, so efficiency trumps readability.
  • No nesting means that the maximum slice size can be higher, with the driver deciding how they want to use that.


// Capacity is the total capacity of the named resource.
// +required
Capacity resource.Quantity `json:"capacity"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not using a resource.Quantity. As pointed out elsewhere, this is not a user-facing API. It's for the scheduler machinery. A simpler Count int64 would be easier to encode and work with.

If there ever is a need for a value < 1, then the driver can scale both capacity and consumed values up by a fixed factor (1000, 1000000).

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 14, 2024
Comment on lines 29 to 30
- capacity: 40Gi
name: gpu-0-memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be a capacity, but it does need to be an attribute so that users can select on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does need to have capacities for all of the memory blocks then though (which is probably more consistent anyway).

Comment on lines 8 to 26
- attributes:
- name: uuid
string: GPU-c2fe2ade-955a-421f-976e-3f1dd7254234
- name: product-name
string: Mock NVIDIA A100-SXM4-40GB
- name: brand
string: Nvidia
- name: architecture
string: Ampere
- name: cuda-compute-capability
version: 8.0.0
- name: driver-version
version: 550.54.15
- name: cuda-driver-version
version: 12.4.0
- bool: true
name: mig-capable
- bool: false
name: mig-enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine there to be quite a few more attributes here eventually. Specifically around NVLINKs, PCIERoot complexes, etc. These will also need to be duplicated in each MIG device (unless we build a section for common attributes, grouped by name, that can be referenced by different devices).

@thockin
Copy link

thockin commented Jun 15, 2024

Since this is "what's in the KEP" I think we should merge it and rebase all the options on it, so they appear as diffs. But I screwed up and LGTM'ed option 2 (#29) I don't hjave super on this repo, so I cannot manually fix

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 15, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2024
@johnbelamaric
Copy link
Contributor Author

Ok, this matches the KEP. Merging.

@johnbelamaric johnbelamaric merged commit fdc5827 into kubernetes-sigs:main Jun 16, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants