- 
                Notifications
    You must be signed in to change notification settings 
- Fork 42
✨ Improve re-entrancy of device creation and delete workflows #209
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold for maintainers to give a final look
I did the test running locally the e2e and it runs fine
will be good to we add the account in cncf so we can test this and become more confident.
| /assign @gianarb | 
| switch { | ||
| // TODO: find a better way than parsing the error messages for this. | ||
| case strings.Contains(err.Error(), " no available hardware reservations "): | ||
| case err != nil && strings.Contains(err.Error(), " no available hardware reservations "): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this non-fatal? (I could use comments)
Does the provider API return an error even though the device will be created, making this more of a provider warning than a fatal error?
Is the HTTP response a 2xx with an errors block, perhaps? If so that may be a better way to detect this condition than string matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the comment to provide more info, but if we do not special case this then Machine creation will fail after the first attempt and will not be retried (presence of ErrorMessage/ErrorReason in CAPI indicates a fatal error that should not be retried).
This was causing issues with users that wanted to run clusters on reserved hardware without having additional free reservations available during the upgrade, since attempts to create new Machines before the reservations were made available by Machines that were being deleted weren't immediately available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now in this block, we are going to try to adopt any existing device that carries the tags we would have placed in our previous attempt to create this device.
Is this because the create call can error before returning the UUID (perhaps a timeout), somehow despite the error the device creation may succeed, and we want to try to salvage the creation?
It mainly means that we haven't recorded the providerID for any reason. That could be:
- k8s api call to patch the resource to set the providerID failed
- controller crashed, preventing reconciliation from completing
- api call to EM API to create device succeeded, but returned an error for some reason (could be timeout)
Thinking through this more, I should also update reconcileDelete to account for these use cases as well to avoid orphaning resources in the same cases as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the updated reconcile logic in this PR may prevent any creates.
I also left some nitpicks about comments and the conditions used (before this PR and in this PR).
| Moving the immediate fix for v0.3.6 to #212 to separate discussion about the reentrancy changes. | 
| /lgtm Thanks!! | 
| } | ||
|  | ||
| if device != nil { | ||
| if err := r.deleteDevice(device.ID, force); err != nil { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to put the same http.StatusNotFound check here, or make deleteDevice include that logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should be needed here. If by chance the device is deleted in between the list call made in GetDeviceByTags and the call to delete, then we'll just end up doing another reconciliation to remove the finalizer.
| [APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cpanato, detiber, displague, gianarb The full list of commands accepted by this bot can be found here. The pull request process is described here 
Needs approval from an approver in each of these files:
 
 Approvers can indicate their approval by writing  | 
| Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with  Send feedback to sig-contributor-experience at kubernetes/community. | 
| Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with  Send feedback to sig-contributor-experience at kubernetes/community. | 
| /remove-lifecycle rotten | 
| dev, err = r.PacketClient.GetDevice(providerID) | ||
| if err != nil { | ||
| var perr *packngo.ErrorResponse | ||
| if errors.As(err, &perr) && perr.Response != nil && perr.Response.StatusCode == http.StatusNotFound { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should also capture http.StatusForbidden for #245, unless the behavior has changed in the provider API.
| Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with  Send feedback to sig-contributor-experience at kubernetes/community. | 
| The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules: 
 You can: 
 Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten | 
| Closing in favor of #269 | 
| /close | 
| @detiber: Closed this PR. In response to this: 
 Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. | 
What this PR does / why we need it: