Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 66 additions & 13 deletions dra-evolution/pkg/api/capacity_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,14 +59,10 @@ type ResourcePoolSpec struct {
// vendor of the driver.
DriverName string `json:"driverName" protobuf:"bytes,3,name=driverName"`

// SharedCapacity defines the set of shared capacity consumable by
// devices in this ResourceSlice.
//
// Must not have more than 128 entries.
// DeviceShape defines the common shape of all devices in this pool.
//
// +listType=atomic
// +optional
SharedCapacity []SharedCapacity `json:"sharedCapacity,omitempty"`
// +required
DeviceShape DeviceShape `json:"deviceShape"`

// Devices lists all available devices in this pool.
//
Expand All @@ -81,15 +77,53 @@ type ResourcePoolSpec struct {
const ResourcePoolMaxSharedCapacity = 128
const ResourcePoolMaxDevices = 128

type DeviceShape struct {
Copy link
Contributor

@klueska klueska Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still struggling to see how I share attributes across top level GPUs. Is that not supported in this model? Or do I need to put all of my top-level GPUs into one DeviceShape to take advantage of this?

Copy link
Contributor Author

@johnbelamaric johnbelamaric Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is exactly one shape for any slice. Check out the example YAML.

Cardinality:

  • One slice has one shape. The slice effectively encapsulates a model of device. If you have a node with several models of GPU, you would have a slice for each model.
  • One shape has many partitions. This is the "map" of how this shape device can be partitioned.
  • One slice has many devices, all of the same, single shape. Thus, the only thing you need in device entries are the device names and any device-specific attributes.

What you cannot do in this as it exists here is have per-device, per-partition attribute. But otherwise it is as expressive as the fully denormalized model.

If we want to pack things in tighter, we could allow multiple shapes per slice, but I don't see any value in that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need each MIG device to have a reference to its Parent GPU's UUID (so that they can be aligned on that). How does one do that with this model? Likewise, for PCIEroot, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device as seen by the scheduler will have all the attributes of the shape, the partition, and the device. So the device UUID will be there for every MIG.

Now, if we need to have a separate UUID for the MIG device and for the parent device, it is a bit more tricky, because the we may have name conflicts.

So, we may need a little more attribute manipulation magic here. And we may need a standard attribute to mean "partition". I prefer to see if we can get by with published conventions rather than manipulations. Another alternative is qualifying attributes by shape, partition, and device, but that seems very hard on the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device as seen by the scheduler will have all the attributes of the shape, the partition, and the device. So the device UUID will be there for every MIG.

The idea is that the scheduler sees an unrolled / flattened / projected version of this into a flat list of devices. So from the claim standpoint, it is equivalent to the current model.

// Attributes defines the attributes of this device shape.
// The name of each attribute must be unique.
//
// Must not have more than 32 entries.
//
// +listType=atomic
// +optional
Attributes []DeviceAttribute `json:"attributes,omitempty" protobuf:"bytes,3,opt,name=attributes"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these become shared attributes that all partitions share? No ability to define e.g. the common set of attributes just for e.g. 1g.5gb devices that exist at different memory slices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, those still get layered in on a per-partition basis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I want to be able to define e.g.

deviceAttributeGroups:
  - name: common-attributes
    items: ...
  - name: common-mig-1g.5gb-attributes
    items: ...
  - name: common-mig-2g.10gb-attributes
    items: ...

And then layer them in as appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, partitions individually have attributes. So, in the "device" that is presented to the claim side of the model, you have shape.Attributes + partition.Attributes + device.Attributes available to CEL.

I do NOT go so far as to define a "PartitionShape" which common-izes attributes from a particular type of Partition. That is possible but there are diminishing returns. Option 5 does effectively do that (and more).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a byproduct of nesting. If everything is top-level and referenceable by a name then it "just works" and is fully flexible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the "shape" captures the shared capacity / shared capacity consumed model as well as the common attributes once and re-use it across all devices. That's where the big size win comes in, not just with attributes but with the partition model. I am not clear on how you plan to capture both of those, it would be useful to see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried top-level shared capacity groups in Option 2, it actually made things LARGER. Maybe I did it wrong? I don't think I did. I think because they are groups per-GPU, we list them over and over. Whereas with the nesting, we don't need to list them, because they are implicit in the structure.

Maybe we can do that implicit duplication with shared groups at the top level? I am not sure how we know which to do that with and which to as independent. Maybe a flag? We also need to capture the repeated "consumption" models, rather than repeat them on a per-device basis, to get the full effect.

I think that's much harder to understand, frankly, and the only thing I see us gaining is a theoretically ability to have an "independent" shared group that is across the slice not on the per-device basis. Which is something we could always add later at the slice level, if we needed it. We have no need for it at this point, that I know of.

Copy link
Contributor

@klueska klueska Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat off topic, but something you said here made me think of this....

I haven't discussed it much, but MIG actually supports a second level of nesting. I think it could still be modeled with this Option 4 (the second level of nesting still pulls from the same pool of resources).

However, we want to limit selection of devices from this second level of nesting to a single claim (because they all share access to the same physical memory). Meaning if someone requests e.g. a 1c.3g.20gb device in two different claims, they MUST come from different parent MIGs.

i.e. the following, but guarantee that these devices are pulled from a MIG device that has not had other 2nd-level devices pulled from it yet.

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  name: mig-devices-and-nics
spec:
  requests:
  - name: mig-0
    deviceClassName: gpu.nvidia.com
    expression: "device['type'] == 'MIG' && device['profile'] == '1c.3g.20gb'"
    
  - name: mig-1
    deviceClassName: gpu.nvidia.com
    expression: "device['type'] == 'MIG' && device['profile'] == '2c.3g.20gb'"
    
  requirements:
  - matchAttribute: parentMIG

We currently support this in our "classic" DRA driver (because we control the API through our CRD), but I don't see an easy way to support this restriction with structured parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thougts on this?

Copy link
Contributor Author

@johnbelamaric johnbelamaric Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do these nested partitions really need to be visible in K8s? Are they offered to different containers? Could this be part of the config?

Since they are not consumable across claims, I don't see how K8s even knowing about them matters; the scheduler wouldn't have any flexibility but to give a whole MIG anyway. The only flexibility would be association with different containers. But maybe the driver can handle that for these (don't associate the MIG with any specific container).


// Partitions defines the set of partitions into which this device shape
// may be allocated. If not populated, then the device shape is always
// consumed in its entirety.
//
// +listType=atomic
// +optional
Partitions []DevicePartition `json:"partitions,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "simple" case, when I just have a bunch of devices that are not partitionable, what does my resourceSlice look like? Do I need to still fill this in with just 1 device (which is actually not partitionable)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Nevermind, I see now -- this is not a replacement for Device, it is something that defines how every device in this slice "looks", modulo its name and any custom attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly


// SharedCapacity defines the set of shared capacity consumable by
// partitions in this DeviceShape. Not meaninful for non-partitioned
// devices.
//
// Must not have more than 128 entries.
//
// +listType=atomic
// +optional
SharedCapacity []SharedCapacity `json:"sharedCapacity,omitempty"`
}

// Device represents one individual hardware instance that can be selected based
// on its attributes.
type Device struct {
// Name is unique identifier among all devices managed by
// the driver on the node. It must be a DNS label.
type DevicePartition struct {
// Name is unique identifier among all partitions for this device. The
// device name as recorded in the allocation will be the concatenation
// of the device name and the partition name with a '-' separator.
//
// NOTE: may need a better naming scheme
//
// It must be a DNS label.
Name string `json:"name" protobuf:"bytes,1,name=name"`

// Attributes defines the attributes of this device.
// The name of each attribute must be unique.
// Attributes defines the attributes of this partition.
// The name of each attribute must be unique. The values
// in here are overlayed on top of the values in the device
// shape (overwriting them if the names are the same).
//
// NOTE: probably can get away with fewer
//
// Must not have more than 32 entries.
//
Expand All @@ -98,7 +132,7 @@ type Device struct {
Attributes []DeviceAttribute `json:"attributes,omitempty" protobuf:"bytes,3,opt,name=attributes"`

// SharedCapacityConsumed defines the set of shared capacity consumed by
// this device.
// this partition.
//
// Must not have more than 32 entries.
//
Expand All @@ -107,6 +141,25 @@ type Device struct {
SharedCapacityConsumed []SharedCapacity `json:"sharedCapacityConsumed,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I limited to the SharedCapacity from the DeviceShape where this partition is embedded? So, for example, if there were some "uncore" resources shared by all GPUs we couldn't consume that now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the canonical example, the DeviceShape is a physical GPU, with a map of how it is partitonable. The Devices then just list the physical GPUs. I am not clear on what shared resources might live outside the GPU. Are you saying you need to capture shared resources that are not part of the GPU and somehow span GPUs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I'm saying. I don't necessarily see it for GPUs, but I'm sure there are other devices that will want the ability to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can add something later for that...

}

type Device struct {
// Name is unique identifier among all devices managed by
// the driver on the node. It must be a DNS label.
Name string `json:"name" protobuf:"bytes,1,name=name"`

// Attributes defines the attributes of this device.
// The name of each attribute must be unique. The values
// in here are overlayed on top of the values in the device
// shape (overwriting them if the names are the same).
//
// Must not have more than 32 entries.
//
// NOTE: probably can get away with fewer
//
// +listType=atomic
// +optional
Attributes []DeviceAttribute `json:"attributes,omitempty" protobuf:"bytes,3,opt,name=attributes"`
}

const ResourcePoolMaxAttributesPerDevice = 32
const ResourcePoolMaxSharedCapacityConsumedPerDevice = 32

Expand Down
81 changes: 62 additions & 19 deletions dra-evolution/pkg/gen/nvidia.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ package gen

import (
"fmt"
"strings"

"github.com/NVIDIA/go-nvml/pkg/nvml/mock/dgxa100"
"github.com/kubernetes-sigs/wg-device-management/dra-evolution/pkg/api"
Expand All @@ -15,13 +16,30 @@ import (
)

func dgxa100Pool(nodeName, poolName string, gpus int) (*api.ResourcePool, error) {
shapeAttrs := map[string]bool{
"product-name": true,
"brand": true,
"architecture": true,
"cuda-compute-capability": true,
"driver-version": true,
"cuda-driver-version": true,
}
partitionAttrs := map[string]bool{
"mig-profile": true,
}
deviceAttrs := map[string]bool{
"uuid": true,
"mig-capable": true,
"mig-enabled": true,
}

// Instantiate an instance of a mock dgxa100 server and build a nvDeviceLib
// from it. The nvDeviceLib is then used to populate the list of allocatable
// devices from this mock server using standard NVML calls.
l := nvdevicelib.New(dgxa100.New())

var shape api.DeviceShape
var devices []api.Device
var shared []api.SharedCapacity
for gpu := 0; gpu < gpus; gpu++ {
// Get the full list of allocatable devices from this GPU on the server
allocatable, err := l.GetPerGpuAllocatableDevices(gpu)
Expand All @@ -39,12 +57,30 @@ func dgxa100Pool(nodeName, poolName string, gpus int) (*api.ResourcePool, error)
return nil, fmt.Errorf("found %d shared limit groups in the resources", len(model.NamedResources.SharedLimits))
}

shared = append(shared, sharedGroupToResources(model.NamedResources.SharedLimits[0], gpu)...)
if gpu == 0 {
// first time through, create the shape
shape.SharedCapacity = sharedGroupToResources(model.NamedResources.SharedLimits[0])
}

var devAttrVals []api.DeviceAttribute
var partitions []api.DevicePartition
for _, instance := range model.NamedResources.Instances {
devices = append(devices, instanceToDevice(instance, gpu))
if gpu == 0 {
shape.Attributes = attributesToDeviceAttributes(instance.Attributes, shapeAttrs)
partitions = append(partitions, instanceToPartition(instance, partitionAttrs))
}
devAttrVals = append(devAttrVals, attributesToDeviceAttributes(instance.Attributes, deviceAttrs)...)
}
if shape.Partitions == nil {
shape.Partitions = partitions
}
}

devices = append(devices, api.Device{
Name: fmt.Sprintf("gpu-%d", gpu),
Attributes: devAttrVals,
})

}
return &api.ResourcePool{
TypeMeta: metav1.TypeMeta{
APIVersion: DevMgmtAPIVersion,
Expand All @@ -54,31 +90,39 @@ func dgxa100Pool(nodeName, poolName string, gpus int) (*api.ResourcePool, error)
Name: nodeName + "-" + poolName,
},
Spec: api.ResourcePoolSpec{
NodeName: nodeName,
DriverName: "gpu.nvidia.com/dra",
SharedCapacity: shared,
Devices: devices,
NodeName: nodeName,
DriverName: "gpu.nvidia.com/dra",
DeviceShape: shape,
Devices: devices,
},
}, nil
}

func instanceToDevice(instance newresourceapi.NamedResourcesInstance, gpu int) api.Device {
device := api.Device{
Name: instance.Name,
Attributes: attributesToDeviceAttributes(instance.Attributes),
func instanceToPartition(instance newresourceapi.NamedResourcesInstance, partitionAttrs map[string]bool) api.DevicePartition {
name := strings.TrimPrefix(instance.Name, "gpu-0-")
if name == "gpu-0" {
name = "whole"
}

partition := api.DevicePartition{
Name: name,
Attributes: attributesToDeviceAttributes(instance.Attributes, partitionAttrs),
}

if len(instance.Resources) > 0 {
device.SharedCapacityConsumed = sharedGroupToResources(instance.Resources[0], gpu)
partition.SharedCapacityConsumed = sharedGroupToResources(instance.Resources[0])
}

return device
return partition
}

func attributesToDeviceAttributes(attrs []resourceapi.NamedResourcesAttribute) []api.DeviceAttribute {
func attributesToDeviceAttributes(attrs []resourceapi.NamedResourcesAttribute, keep map[string]bool) []api.DeviceAttribute {
var attributes []api.DeviceAttribute

for _, a := range attrs {
if _, ok := keep[a.Name]; !ok {
continue
}
if a.QuantityValue != nil {
attributes = append(attributes, api.DeviceAttribute{
Name: a.Name,
Expand Down Expand Up @@ -106,14 +150,13 @@ func attributesToDeviceAttributes(attrs []resourceapi.NamedResourcesAttribute) [
return attributes
}

func sharedGroupToResources(group newresourceapi.NamedResourcesSharedResourceGroup, gpu int) []api.SharedCapacity {
func sharedGroupToResources(group newresourceapi.NamedResourcesSharedResourceGroup) []api.SharedCapacity {
var resources []api.SharedCapacity

for _, item := range group.Items {
name := fmt.Sprintf("gpu-%d-%s", gpu, item.Name)
if item.QuantityValue != nil && !item.QuantityValue.IsZero() {
resources = append(resources, api.SharedCapacity{
Name: name,
Name: item.Name,
Capacity: *item.QuantityValue,
})
} else if item.IntRangeValue != nil {
Expand All @@ -123,7 +166,7 @@ func sharedGroupToResources(group newresourceapi.NamedResourcesSharedResourceGro
single := intrange.NewIntRange(int64(i), 1)
if item.IntRangeValue.Contains(single) {
resources = append(resources, api.SharedCapacity{
Name: fmt.Sprintf("%s-%d", name, i),
Name: fmt.Sprintf("%s-%d", item.Name, i),
Capacity: resource.MustParse("1"),
})
}
Expand Down
Loading