Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3063: dynamic resource allocation #3064

Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Nov 30, 2021

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 30, 2021
@pohly pohly changed the title 3063: dynamic resource allocation: initial KEP draft KEP-3063: dynamic resource allocation: initial KEP draft Nov 30, 2021
@pohly pohly force-pushed the dynamic-resource-allocation-upstream branch from 9e2e0c3 to d0d0e21 Compare December 1, 2021 09:25
@swatisehgal
Copy link
Contributor

/cc

@zvonkok
Copy link

zvonkok commented Dec 13, 2021

@sseetharaman6 FYI

@zvonkok
Copy link

zvonkok commented Dec 13, 2021

If we follow the CSI methodology, were there any thoughts about keeping the PV and also utilizing the StorageClass provisioner spec entry, let's say one could create a DeviceClass that has a specific provisioner: nvidia.com/gpu-operator which would provision the devices aka PVs, which then can be allocated as DeviceClaims.

If a Pod requests a DeviceClaim which cannot be bound to a Device aka PV the provisioner in the DeviceClass (GPU) would try to do that. The same analogy is in the CSI world where a StorageClass has the aws-ebs provisioner which takes care to do all the needed things (create, allocate, format) to make the PV available for consumption.

@pohly
Copy link
Contributor Author

pohly commented Dec 13, 2021

were there any thoughts about keeping the PV

The split between PV and PVC has been a big source of pain for storage. It leads to a lot of race conditions, of which some still haven't been addressed (kubernetes-csi/external-provisioner#486). I'd prefer to keep it simpler than that unless a use case gets identified that depends on that split.

For storage, one advantage is that a PV can be created beforehand and then can be bound by Kubernetes to a new PVC that gets created later. We were discussing whether something like that would be useful, but then decided against it because the matching still would have to be done by a third-party add-on. If such pre-allocation is useful, it could be handled by the add-on internally.

and also utilizing the StorageClass provisioner spec entry, let's say one could create a DeviceClass that has a specific provisioner: nvidia.com/gpu-operator which would provision the devices aka PVs, which then can be allocated as DeviceClaims.

There is a ResourceClass (= StorageClass) with a PluginName (= Provisioner) in the current proposal to have that additional level of indirection. This also leads to additional complexity (how can a ResourceClaim be deallocated when the original ResourceClass is gone or modified - the later cannot be prevented reliably because delete/recreate can be used as workaround for immutable fields?), but the proposal tries to cover that by requiring that the plugin name gets copied and parameters in the ResourceClaim must be sufficient for deallocation.

Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the working on this and for the detailed write up. This is a complex topic with a lot of moving part so appreciate the effort put into this. Left a few comments.

I noticed that the roles and responsibilities are implicitly covered in the proposal, it would really help to have user stories from the point of view of a cluster admin and a device vendor like we have here and for it to be explicitly stated what one has to do as a cluster admin or as a device vendor to enable this feature.

keps/sig-node/3063-dynamic-resource-allocation/README.md Outdated Show resolved Hide resolved
keps/sig-node/3063-dynamic-resource-allocation/README.md Outdated Show resolved Hide resolved
keps/sig-node/3063-dynamic-resource-allocation/README.md Outdated Show resolved Hide resolved
keps/sig-node/3063-dynamic-resource-allocation/README.md Outdated Show resolved Hide resolved
keps/sig-node/3063-dynamic-resource-allocation/README.md Outdated Show resolved Hide resolved
}
```

### Communication between kubelet and resource node plugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion: would be nice to have a diagram that clearly shows the endpoint and the full view of gRPC communication between kubelet and the plugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this low-level part needs more work. It's not fully spec-ed out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, looking forward to seeing more details on this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a major update with a more complete implementation section.

@bart0sh is going to add more details to the kubelet section.

<<[/UNRESOLVED]>>
```

#### `NodePrepareResource`
Copy link
Contributor

@swatisehgal swatisehgal Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call it NodeAllocateResource or AllocateResources as I think it better represents the functionality of this RPC? NodePrepareResource gives the impression that we are preparing the node but the actual allocation happens at a later stage (in another RPC) which is not the case here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the resource is already allocated before kubelet and thus this call here get involved.

Copy link
Contributor

@swatisehgal swatisehgal Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the case where devices are shared between various pods i.e. a device is already allocated to a pod but still has bandwidth to accommodate resource request of another pod and then we need to "prepare" the device for the other pod?
If that is the case it boils down when we consider a resource to have been allocated. IMO, it still makes sense to refer to the API as AllocateResourcesas as for that particular container/pod, the kubelet needs to perform some action to make sure that the device can be "allocated" to the new pod.

Copy link
Contributor Author

@pohly pohly Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our goal was to encourage plugins to do as much of the "allocation" as possible before a pod runs gets scheduled. Then NodePrepareResource should be so light-weight that it cannot fail or at least that it is extremely unlikely. The motivation for that approach is that there is no way how a pod can be moved to a different node if NodePrepareResource fails because, by definition, it only gets called after a pod has been permanently scheduled unto a node.

Plugins can deviate from that. They can do nothing when setting "phase: allocated" and do all of the actual allocation during NodePrepareResource. But that's not how it should be done.

| Resource does not exist | 5 NOT_FOUND | Indicates that a resource corresponding to the specified `resource_id` does not exist. | Caller MUST verify that the `resource_id` is correct and that the resource is accessible and has not been deleted before retrying with exponential back off. |


#### `NodeUnprepareResource`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment above, I think we should rename NodeUnprepareResource to NodeDeallocateResource or simply DeallocateResource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #3064 (comment): because the "real" allocation happens earlier, these node-local calls are really just about "preparing" the resource for usage by a pod.

keps/sig-node/3063-dynamic-resource-allocation/README.md Outdated Show resolved Hide resolved

*Limitation*: GPU driver requirements are currently stored as labels
on the container image, and device plugins do not have access to the
container image.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does "device plugins do not have access to the container image" relate to "gracefully fail to start"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska, @kad: can you clarify?

These limitations are often written so that people familiar with the current device interface understand them, but others can be puzzled by it. We might be better off removing those items from the KEP to avoid questions about them.

runtime-specific operations that may need to occur as a result of
injecting a device into a container (device plugins are runtime
agnostic). A good example is supporting GPU passthrough
virtualization on kata vs. runc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limitation seems to complain about no way to perform runtime-specific operation but the requirement states "irrespective of the underlying container runtime" which is runtime agnostic. Could you clarify the use case and the problem this KEP can help solve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska, @kad: can you clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get this one as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed all limitations after "problems that are handled by CDI". The motivation section now just ends with:

Several other limitations are addressed by
CDI.

The other limitations directly affect design decisions in the KEP and therefore are relevant, but not so much the things that can be done with CDI besides providing some additional justification for doing this work - hopefully that isn't questionable 😅

- *Access to the plugin container*: When deploying a device plugin, I
would like to ensure that all of the operations that need to occur
as part of “injecting” a device into a container, also occur for the
“plugin container” itself.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "plugin container", do you mean the "third-party device plugin" that runs as a container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either what the "plugin container" here is. @kad?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example would go a long way

- ResourceClass, not namespaced, with privileged parameters for
multiple resource instances of a certain kind. All these instances
are provided by the same resource plugin, which is identified by a
field in the class.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that the resource class holds parameters of the resource exposed/provided by a specific resource plugin?
Does this resource plugin refer to the third-party plugins on lines 191-192

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResourceClass holds parameters for a resource driver. They are set by the cluster admin and thus can be different in different clusters although the resource driver is the same. These parameters will be used for all ResourceClaim instances that use this ResourceClass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example of a specific device and describe how many possible classes an admin would typically create, just looking to get a sense of the type of parameters that can be set for groups of pods.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this clarification to the KEP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten the ResourceClass line item, including an example.

containers in a pod. Depending on the capabilities defined in the
ResourceClaim by the plugin, a ResourceClaim can be used exclusively
by one pod at a time, by a certain maximum number of pods, or an
unlimited number of pods.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a ResourceClaim have capacity? If they are shared, then the design needs to include an accounting of the available capacity, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal only covers counting users of a ResourceClaim. We can't do more than that because a concept of "capacity" would be resource specific and thus out of scope of this KEP.

In practice, the common cases will probably be "a single user" (= exclusive access) and "unlimited". The API also supports a fixed number because that is trivial to support and perhaps might be useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add this clarification to the KEP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@pohly
Copy link
Contributor Author

pohly commented Jan 17, 2022

Cathy pointed out that in several places, "plugin" is ambiguous. Sometimes the context determines whether it is a plugin for the scheduler, kubelet or the entire cluster. We cannot do much about scheduler and kubelet plugins, that's simply what they are called there.

But we can avoid "plugin" as name for the third-party cluster add-on. Let's do the same as for storage and call it "driver". Then we have:

  • ResourcePlugin -> ResourceDriver (as in the existing CSIDriver type)
  • ResourceClass.PluginName -> ResourceClass.DriverName

"controller" is also overloaded. We have a "resource claim controller" (the in-tree component that we add to kube-controller-manager" and "resource controller" (the control plane part of the third-party cluster add-on). Can someone think of a better name for the latter?

I decided against "Allocator" earlier because that seemed too specific, but right now allocation and deallocation is exactly what that component handles. Because we don't need to bake that name into any API (i.e. it's only used in the KEP and presentations), we can use that name now and if the purpose ever changes, pick something more generic when that happens.

Finally, do we need a more specific term instead of just "resource"? There are already "native resources" (the ones handled by kube-scheduler itself), "extended resources" (current device interface, handled by webhooks), and now "custom or generic resources"? If we pick such an adjective, should we also use it in the API (ResourceClaim -> CustomResourceClaim, PodSpec.Resources -> PodSpec.CustomResources, ContainerSpec.PodResources -> ContainerSpec.CustomResources)?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2022
@pohly
Copy link
Contributor Author

pohly commented Jun 23, 2022

As this now seems to be heading towards "implementable" (:crossed_fingers:) , let me include the additional commits that I had pending in the other PR. All of those changes are in proper additional commits (thus easy to review) and just tweak the text. The main change is the status change to "implementable".

devices must be carefully tuned to ensure that GPU resources do not go unused
because some of the pre-partioned devices are in low-demand. It also puts
the burden on the user to pick a particular MIG device type, rather than
declaring the resource constraints more abstractly.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Picking a MIG device type (a name for a pre-configured set of resources) would be the more-abstract option, I think what we want (and get) from this KEP is being able to declare our resource constraints more "explicitly".

Copy link
Contributor

@klueska klueska Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how you look at it. By “more abstractly” I meant you might only care about making sure you get at least 5GB of memory on your GPU and then the plugin can either create a MIG device that satisfies that for you, or give you a full GPU if that’s all that’s available and it’s OK with that. With pre-defined MIG devices a user has to explicitly request that device type rather than providing a more abstract set of resource constraints that the plugin can work with to find the best GPU for the job.

Copy link

@TBBle TBBle Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I read the case we're trying to enable not as "choosing the best GPU", but subdividing a GPU to match your explicit resource constraints, e.g., "at least 5GB", on-demand, vs have to specify only a pre-defined GPU division (I thought that's what "MIG device type" meant here) which probably matches the container's requirements, but is abstracted behind a name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Looking at just the "resource sharing" / subdivison aspect what you say makes sense. I ended up conflating 2 different advantages of the new proposal into 1, i.e. better support for "resource sharing" and better support for matching workloads to the "right sized" resources based on an abstract set of constraints.

Copy link
Member

@gjtempleton gjtempleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor points, more informs though.

Thanks for all the hard work on this.

For SIG Autoscaling:
/lgtm

drivers can be added to the configuration file defined by the `--cloud-config`
configuration file with a common field that gets added in all cloud provider
configs or a new `--config` parameter can be added. Later, dynamically
discovering deployed webhooks can be added through an autoscaler CRD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor heads up that we're beginning discussions about improving the config mechanism for the CA to stop the continuing explosion in command-line flags, current discussion focusing on use of configmaps.

Comment on lines +1753 to +1758
`NodeReady` is needed to solve one particular problem: when a new node first
starts up, it may be ready to run pods, but the pod from a resource driver's
DaemonSet may still be starting up. If the resource driver controller needs
information from such a pod, then it will not be able to filter
correctly. Similar to how extended resources are handled, the autoscaler then
first needs to wait until the extender also considers the node to be ready.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely add some complexity to advice around the unready flags in the CA for slow to start resource driver controllers, especially -ok-total-unready-count, though easily addressed.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2022
@pohly
Copy link
Contributor Author

pohly commented Jun 23, 2022

/hold

Let's make sure that @alculquicondor and @thockin confirm merging of this PR as "implementable".

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2022
// The resource driver checks this fields and resets it to false
// together with clearing the Allocation field. It also sets it
// to false when the resource is not allocated.
Deallocate bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is going to be something to watch out for during implementation. But good enough as a high level idea.

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
from sig-scheduling perspective.

Just for the record, I still have my reservations on the cluster-autoscaler story.

@thockin
Copy link
Member

thockin commented Jun 23, 2022

/lgtm

This whole proposal is VERY daunting, but I don't see why we should block it further - the bar was set and cleared.

CDI is a new animal to me, so we should watch it carefully.

Important: we should not merge any part of this until we resolve the open issues ESPECIALLY around scheduling. We don't want to get to 1.25 code freeze and have a partially built mess to deal with :)

@thockin thockin removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2022
@dchen1107
Copy link
Member

/lgtm from SIG Node perspective.

Added some comments, and remain some concerns on the complexities and security impacts for K8s vendors, but ok to carry on those discussion during the implementation phase or via pohly#13.

@klueska
Copy link
Contributor

klueska commented Jun 23, 2022

@dchen1107 since this is a sig-node KEP it needs an /approve from you as well in order to move forward.

@dchen1107
Copy link
Member

/approve

@pohly
Copy link
Contributor Author

pohly commented Jun 23, 2022

/assign @johnbelamaric

Can you do the final approval?

@johnbelamaric
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, johnbelamaric, klueska, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.