Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a max failed CNRs threshold to Nodegroups #88

Merged
merged 8 commits into from
Sep 2, 2024

Conversation

vincentportella
Copy link
Member

@vincentportella vincentportella commented Aug 23, 2024

  • Adding a new maxFailedCycleNodeRequests attribute to the NodeGroup spec which defines how many failed CNRs are allowed for a nodegroup before observer will stop generating new ones. The equivalent to the current behaviour is maxFailedCycleNodeRequests: 0. This is not a breaking change.
  • The Cyclops controller will now delete any sibling failed CNRs when a new one reaches the Successful phase.
  • The cyclops_cycle_node_requests_by_phase metric has been update to include a new nodegroup attribute which is the name of the nodegroup the CNR was generated from. Manually created CNRs that don't match a nodegroup will include the attribute with an empty string.

@vincentportella vincentportella changed the title Add a max failed CNRs threshold to a Nodegroup Add a max failed CNRs threshold to Nodegroups Aug 23, 2024
@vincentportella vincentportella self-assigned this Aug 23, 2024
awprice
awprice previously approved these changes Aug 26, 2024
…to vportella/add-max-failed-cnrs-threshold
@@ -30,6 +44,10 @@ type Transitioner struct {

CloudProviderInstances []*mock.Node
KubeNodes []*mock.Node

extrakubeObjects []client.Object
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extraKubeObjects

},
}

// Failed CNR for the same nodegroup in a different namespace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are CNRs intended to be namespaced? I can't see much value in separating them by namespace like this but being able to act on the same set of backing nodes provided by a cloud provider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would go back to when they were designed. I'm not particularly sure, though changing that is out of scope here.

@@ -536,3 +536,38 @@ func (t *CycleNodeRequestTransitioner) validateInstanceState(validNodeGroupInsta

return false, nil
}

// deleteFailedSiblingCNRs finds the CNRs generated for the same nodegroup as
// the one in the calling transitioner. It filters for deleted CNRs in the same
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's not a calling transitioner, it's the parent struct technically. I would just say "the one in the transitioner"

}
}
if found {

if dropNodeGroup {
klog.Warningf("nodegroup %q has an in progress CNR.. skipping this nodegroup", nodeGroup.Name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message now doesn't match reality. dropNodeGroup is being set to true when the number of failed CNRs exceeds the threshold, and not when it has an in progress CNR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a second log line there because Failed is defined as "in progress" in the existing implementation. I'm adding functionality to skip a certain number of them from counting as "in progress" so I will update for that.

mwhittington21
mwhittington21 previously approved these changes Aug 29, 2024
mwhittington21
mwhittington21 previously approved these changes Sep 2, 2024
@MinyiZ MinyiZ self-requested a review September 2, 2024 00:23
@vincentportella vincentportella merged commit dc687f3 into master Sep 2, 2024
3 checks passed
@vincentportella vincentportella deleted the vportella/add-max-failed-cnrs-threshold branch September 2, 2024 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants