-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter NodeClaim Disruption spec.nodeClassRef
Drift Detection
#337
Comments
Knative has a "duck hunter" solution for this exact use case: https://github.com/knative/pkg/blob/main/apis/duck/ABOUT.md. I don't think we should build in this complexity, though. |
spec.nodeClaimRef
Detection
spec.nodeClaimRef
Detectionspec.nodeClaimRef
Drift Detection
This could be interesting. Looks like the latest versions of |
referring due to similarity of the problem #825 . |
Some ideas around how we might be able to handle this:
|
I like #1, to use the refs to dynamically drive watchers. |
spec.nodeClaimRef
Drift Detectionspec.nodeClassRef
Drift Detection
It's definitely the lowest CloudProvider burden. Would be interesting to see how we could handle this with the new dynamic watch construct that controller-runtime has put out. Basically, whenever you get a NodePool creation, you can check if the watch exists and add it to the controller. When you get a Delete you do the opposite. You shouldn't miss a delete while the controller is up and when the controller is down you don't care. |
Looking into this a little deeper, it doesn't look like they offer a way to unregister the watcher; so, if we went this route, we would be implementing a watch that would be added as soon as the kind showed up and existed for the lifetime of the run. Alternatively, we could consider using a channel here as a source for events that came from another controller; however, in that case we would probably end up having to maintain our own register/deregister logic for informers to maintain knowledge from the dynamic client about the updates. |
More context for removing watches on the informers dynamically can be found in this proposed PR: kubernetes-sigs/controller-runtime#2159 with some of the original context for the PR found in this issue: kubernetes-sigs/controller-runtime#1884 |
I wrote a little POC of trying this out with a controller here: https://github.com/jonathan-innis/karpenter/blob/requeue-on-gvk/pkg/controllers/nodepool/gvk/controller.go. It surfaces a watch channel which can be used and watched on by other controller like is seen here: https://github.com/jonathan-innis/karpenter/blob/requeue-on-gvk/pkg/controllers/nodeclaim/disruption/controller.go#L135 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Tell us about your request
Currently, the node controller which is responsible for annotating nodes as expired, drifted, or empty, watches node, provisioner, and pod events, and enqueues reconcile requests for the linked nodes.
https://github.com/aws/karpenter-core/blob/main/pkg/controllers/node/controller.go#L110-L140
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Since karpenter-core isn't aware of provider-specific CRDs, it doesn't track watch events for these CRDs.
We may be able to just watch
v1beta1.NodeClaimReference
objects.Are you currently working around this issue?
This only affects the drift annotation. If someone changes their AWSNodeTemplate, they'll have to rely on the node heartbeat or another object event to get the node annotated. Since the heartbeat is 5 minutes, this is still happening at a good pace, but it could definitely be faster.
Additional Context
No response
Attachments
No response
Community Note
The text was updated successfully, but these errors were encountered: