Skip to content
This repository was archived by the owner on Aug 12, 2025. It is now read-only.

Conversation

@detiber
Copy link
Contributor

@detiber detiber commented Dec 4, 2020

What this PR does / why we need it:

  • Enables better re-entrancy for controller crashes during Machine creation and deletion

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 4, 2020
@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 4, 2020
Copy link
Contributor

@cpanato cpanato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold for maintainers to give a final look

I did the test running locally the e2e and it runs fine
will be good to we add the account in cncf so we can test this and become more confident.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2020
@cpanato
Copy link
Contributor

cpanato commented Dec 5, 2020

/assign @gianarb

switch {
// TODO: find a better way than parsing the error messages for this.
case strings.Contains(err.Error(), " no available hardware reservations "):
case err != nil && strings.Contains(err.Error(), " no available hardware reservations "):
Copy link
Contributor

@displague displague Dec 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this non-fatal? (I could use comments)

Does the provider API return an error even though the device will be created, making this more of a provider warning than a fatal error?
Is the HTTP response a 2xx with an errors block, perhaps? If so that may be a better way to detect this condition than string matching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the comment to provide more info, but if we do not special case this then Machine creation will fail after the first attempt and will not be retried (presence of ErrorMessage/ErrorReason in CAPI indicates a fatal error that should not be retried).

This was causing issues with users that wanted to run clusters on reserved hardware without having additional free reservations available during the upgrade, since attempts to create new Machines before the reservations were made available by Machines that were being deleted weren't immediately available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now in this block, we are going to try to adopt any existing device that carries the tags we would have placed in our previous attempt to create this device.

Is this because the create call can error before returning the UUID (perhaps a timeout), somehow despite the error the device creation may succeed, and we want to try to salvage the creation?

It mainly means that we haven't recorded the providerID for any reason. That could be:

  • k8s api call to patch the resource to set the providerID failed
  • controller crashed, preventing reconciliation from completing
  • api call to EM API to create device succeeded, but returned an error for some reason (could be timeout)

Thinking through this more, I should also update reconcileDelete to account for these use cases as well to avoid orphaning resources in the same cases as above.

Copy link
Contributor

@displague displague left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the updated reconcile logic in this PR may prevent any creates.

I also left some nitpicks about comments and the conditions used (before this PR and in this PR).

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 7, 2020
@detiber
Copy link
Contributor Author

detiber commented Dec 7, 2020

Moving the immediate fix for v0.3.6 to #212 to separate discussion about the reentrancy changes.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 7, 2020
@detiber detiber changed the title 🐛 Fix crash on Machine creation and improve re-entrancy 🐛 Improve re-entrancy of device creation and delete workflows Dec 7, 2020
@detiber detiber changed the title 🐛 Improve re-entrancy of device creation and delete workflows ✨ Improve re-entrancy of device creation and delete workflows Dec 7, 2020
@gianarb
Copy link
Contributor

gianarb commented Dec 8, 2020

/lgtm

Thanks!!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 8, 2020
}

if device != nil {
if err := r.deleteDevice(device.ID, force); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to put the same http.StatusNotFound check here, or make deleteDevice include that logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be needed here. If by chance the device is deleted in between the list call made in GetDeviceByTags and the call to delete, then we'll just end up doing another reconciliation to remove the finalizer.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cpanato, detiber, displague, gianarb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 7, 2021
@displague
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 20, 2021
dev, err = r.PacketClient.GetDevice(providerID)
if err != nil {
var perr *packngo.ErrorResponse
if errors.As(err, &perr) && perr.Response != nil && perr.Response.StatusCode == http.StatusNotFound {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should also capture http.StatusForbidden for #245, unless the behavior has changed in the provider API.

@k8s-triage-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 3, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 2, 2021
@detiber
Copy link
Contributor Author

detiber commented Sep 8, 2021

Closing in favor of #269

@detiber
Copy link
Contributor Author

detiber commented Sep 8, 2021

/close

@k8s-ci-robot
Copy link
Contributor

@detiber: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants