-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3063: dynamic resource allocation #3064
KEP-3063: dynamic resource allocation #3064
Conversation
9e2e0c3
to
d0d0e21
Compare
/cc |
@sseetharaman6 FYI |
If we follow the CSI methodology, were there any thoughts about keeping the PV and also utilizing the StorageClass If a Pod requests a DeviceClaim which cannot be bound to a Device aka PV the provisioner in the DeviceClass (GPU) would try to do that. The same analogy is in the CSI world where a StorageClass has the |
The split between PV and PVC has been a big source of pain for storage. It leads to a lot of race conditions, of which some still haven't been addressed (kubernetes-csi/external-provisioner#486). I'd prefer to keep it simpler than that unless a use case gets identified that depends on that split. For storage, one advantage is that a PV can be created beforehand and then can be bound by Kubernetes to a new PVC that gets created later. We were discussing whether something like that would be useful, but then decided against it because the matching still would have to be done by a third-party add-on. If such pre-allocation is useful, it could be handled by the add-on internally.
There is a ResourceClass (= StorageClass) with a PluginName (= Provisioner) in the current proposal to have that additional level of indirection. This also leads to additional complexity (how can a ResourceClaim be deallocated when the original ResourceClass is gone or modified - the later cannot be prevented reliably because delete/recreate can be used as workaround for immutable fields?), but the proposal tries to cover that by requiring that the plugin name gets copied and parameters in the ResourceClaim must be sufficient for deallocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the working on this and for the detailed write up. This is a complex topic with a lot of moving part so appreciate the effort put into this. Left a few comments.
I noticed that the roles and responsibilities are implicitly covered in the proposal, it would really help to have user stories from the point of view of a cluster admin and a device vendor like we have here and for it to be explicitly stated what one has to do as a cluster admin or as a device vendor to enable this feature.
} | ||
``` | ||
|
||
### Communication between kubelet and resource node plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A suggestion: would be nice to have a diagram that clearly shows the endpoint and the full view of gRPC communication between kubelet and the plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this low-level part needs more work. It's not fully spec-ed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, looking forward to seeing more details on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a major update with a more complete implementation section.
@bart0sh is going to add more details to the kubelet section.
<<[/UNRESOLVED]>> | ||
``` | ||
|
||
#### `NodePrepareResource` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call it NodeAllocateResource
or AllocateResources
as I think it better represents the functionality of this RPC? NodePrepareResource
gives the impression that we are preparing the node but the actual allocation happens at a later stage (in another RPC) which is not the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the resource is already allocated before kubelet and thus this call here get involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the case where devices are shared between various pods i.e. a device is already allocated to a pod but still has bandwidth to accommodate resource request of another pod and then we need to "prepare" the device for the other pod?
If that is the case it boils down when we consider a resource to have been allocated. IMO, it still makes sense to refer to the API as AllocateResourcesas
as for that particular container/pod, the kubelet needs to perform some action to make sure that the device can be "allocated" to the new pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our goal was to encourage plugins to do as much of the "allocation" as possible before a pod runs gets scheduled. Then NodePrepareResource
should be so light-weight that it cannot fail or at least that it is extremely unlikely. The motivation for that approach is that there is no way how a pod can be moved to a different node if NodePrepareResource
fails because, by definition, it only gets called after a pod has been permanently scheduled unto a node.
Plugins can deviate from that. They can do nothing when setting "phase: allocated" and do all of the actual allocation during NodePrepareResource
. But that's not how it should be done.
| Resource does not exist | 5 NOT_FOUND | Indicates that a resource corresponding to the specified `resource_id` does not exist. | Caller MUST verify that the `resource_id` is correct and that the resource is accessible and has not been deleted before retrying with exponential back off. | | ||
|
||
|
||
#### `NodeUnprepareResource` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment above, I think we should rename NodeUnprepareResource
to NodeDeallocateResource
or simply DeallocateResource
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #3064 (comment): because the "real" allocation happens earlier, these node-local calls are really just about "preparing" the resource for usage by a pod.
|
||
*Limitation*: GPU driver requirements are currently stored as labels | ||
on the container image, and device plugins do not have access to the | ||
container image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does "device plugins do not have access to the container image" relate to "gracefully fail to start"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime-specific operations that may need to occur as a result of | ||
injecting a device into a container (device plugins are runtime | ||
agnostic). A good example is supporting GPU passthrough | ||
virtualization on kata vs. runc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limitation seems to complain about no way to perform runtime-specific operation but the requirement states "irrespective of the underlying container runtime" which is runtime agnostic. Could you clarify the use case and the problem this KEP can help solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get this one as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed all limitations after "problems that are handled by CDI". The motivation section now just ends with:
Several other limitations are addressed by
CDI.
The other limitations directly affect design decisions in the KEP and therefore are relevant, but not so much the things that can be done with CDI besides providing some additional justification for doing this work - hopefully that isn't questionable 😅
- *Access to the plugin container*: When deploying a device plugin, I | ||
would like to ensure that all of the operations that need to occur | ||
as part of “injecting” a device into a container, also occur for the | ||
“plugin container” itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "plugin container", do you mean the "third-party device plugin" that runs as a container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure either what the "plugin container" here is. @kad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an example would go a long way
- ResourceClass, not namespaced, with privileged parameters for | ||
multiple resource instances of a certain kind. All these instances | ||
are provided by the same resource plugin, which is identified by a | ||
field in the class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that the resource class holds parameters of the resource exposed/provided by a specific resource plugin?
Does this resource plugin refer to the third-party plugins on lines 191-192
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ResourceClass holds parameters for a resource driver. They are set by the cluster admin and thus can be different in different clusters although the resource driver is the same. These parameters will be used for all ResourceClaim instances that use this ResourceClass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give an example of a specific device and describe how many possible classes an admin would typically create, just looking to get a sense of the type of parameters that can be set for groups of pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this clarification to the KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rewritten the ResourceClass line item, including an example.
containers in a pod. Depending on the capabilities defined in the | ||
ResourceClaim by the plugin, a ResourceClaim can be used exclusively | ||
by one pod at a time, by a certain maximum number of pods, or an | ||
unlimited number of pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a ResourceClaim have capacity? If they are shared, then the design needs to include an accounting of the available capacity, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal only covers counting users of a ResourceClaim. We can't do more than that because a concept of "capacity" would be resource specific and thus out of scope of this KEP.
In practice, the common cases will probably be "a single user" (= exclusive access) and "unlimited". The API also supports a fixed number because that is trivial to support and perhaps might be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add this clarification to the KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Cathy pointed out that in several places, "plugin" is ambiguous. Sometimes the context determines whether it is a plugin for the scheduler, kubelet or the entire cluster. We cannot do much about scheduler and kubelet plugins, that's simply what they are called there. But we can avoid "plugin" as name for the third-party cluster add-on. Let's do the same as for storage and call it "driver". Then we have:
"controller" is also overloaded. We have a "resource claim controller" (the in-tree component that we add to kube-controller-manager" and "resource controller" (the control plane part of the third-party cluster add-on). Can someone think of a better name for the latter? I decided against "Allocator" earlier because that seemed too specific, but right now allocation and deallocation is exactly what that component handles. Because we don't need to bake that name into any API (i.e. it's only used in the KEP and presentations), we can use that name now and if the purpose ever changes, pick something more generic when that happens. Finally, do we need a more specific term instead of just "resource"? There are already "native resources" (the ones handled by kube-scheduler itself), "extended resources" (current device interface, handled by webhooks), and now "custom or generic resources"? If we pick such an adjective, should we also use it in the API (ResourceClaim -> CustomResourceClaim, PodSpec.Resources -> PodSpec.CustomResources, ContainerSpec.PodResources -> ContainerSpec.CustomResources)? |
As this now seems to be heading towards "implementable" (:crossed_fingers:) , let me include the additional commits that I had pending in the other PR. All of those changes are in proper additional commits (thus easy to review) and just tweak the text. The main change is the status change to "implementable". |
devices must be carefully tuned to ensure that GPU resources do not go unused | ||
because some of the pre-partioned devices are in low-demand. It also puts | ||
the burden on the user to pick a particular MIG device type, rather than | ||
declaring the resource constraints more abstractly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Picking a MIG device type (a name for a pre-configured set of resources) would be the more-abstract option, I think what we want (and get) from this KEP is being able to declare our resource constraints more "explicitly".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on how you look at it. By “more abstractly” I meant you might only care about making sure you get at least 5GB of memory on your GPU and then the plugin can either create a MIG device that satisfies that for you, or give you a full GPU if that’s all that’s available and it’s OK with that. With pre-defined MIG devices a user has to explicitly request that device type rather than providing a more abstract set of resource constraints that the plugin can work with to find the best GPU for the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I read the case we're trying to enable not as "choosing the best GPU", but subdividing a GPU to match your explicit resource constraints, e.g., "at least 5GB", on-demand, vs have to specify only a pre-defined GPU division (I thought that's what "MIG device type" meant here) which probably matches the container's requirements, but is abstracted behind a name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Looking at just the "resource sharing" / subdivison aspect what you say makes sense. I ended up conflating 2 different advantages of the new proposal into 1, i.e. better support for "resource sharing" and better support for matching workloads to the "right sized" resources based on an abstract set of constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor points, more informs though.
Thanks for all the hard work on this.
For SIG Autoscaling:
/lgtm
drivers can be added to the configuration file defined by the `--cloud-config` | ||
configuration file with a common field that gets added in all cloud provider | ||
configs or a new `--config` parameter can be added. Later, dynamically | ||
discovering deployed webhooks can be added through an autoscaler CRD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor heads up that we're beginning discussions about improving the config mechanism for the CA to stop the continuing explosion in command-line flags, current discussion focusing on use of configmaps.
`NodeReady` is needed to solve one particular problem: when a new node first | ||
starts up, it may be ready to run pods, but the pod from a resource driver's | ||
DaemonSet may still be starting up. If the resource driver controller needs | ||
information from such a pod, then it will not be able to filter | ||
correctly. Similar to how extended resources are handled, the autoscaler then | ||
first needs to wait until the extender also considers the node to be ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely add some complexity to advice around the unready flags in the CA for slow to start resource driver controllers, especially -ok-total-unready-count
, though easily addressed.
/hold Let's make sure that @alculquicondor and @thockin confirm merging of this PR as "implementable". |
// The resource driver checks this fields and resets it to false | ||
// together with clearing the Allocation field. It also sets it | ||
// to false when the resource is not allocated. | ||
Deallocate bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field is going to be something to watch out for during implementation. But good enough as a high level idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
from sig-scheduling perspective.
Just for the record, I still have my reservations on the cluster-autoscaler story.
/lgtm This whole proposal is VERY daunting, but I don't see why we should block it further - the bar was set and cleared. CDI is a new animal to me, so we should watch it carefully. Important: we should not merge any part of this until we resolve the open issues ESPECIALLY around scheduling. We don't want to get to 1.25 code freeze and have a partially built mess to deal with :) |
/lgtm from SIG Node perspective. Added some comments, and remain some concerns on the complexities and security impacts for K8s vendors, but ok to carry on those discussion during the implementation phase or via pohly#13. |
@dchen1107 since this is a sig-node KEP it needs an /approve from you as well in order to move forward. |
/approve |
/assign @johnbelamaric Can you do the final approval? |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, johnbelamaric, klueska, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/sig node