-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add retry for detach azure disk #74398
add retry for detach azure disk #74398
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andyzhangx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
add more logging info in detach disk
df2edd2
to
8c53db0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
BTW, if move 8 disks from one node to another in parellel, there will be always one disk not detched(only one), the error is like following:
The detach action finally failed, it's more like |
Then let's hold a while for root causes. /hold |
/test pull-kubernetes-e2e-aks-engine-azure |
the funny thing is |
@feiskyer The issue I could repro is related to VMSS, while I still insist on adding this retry logic only for detach disk (not necessary for attach disk since k8s controller will retry if failed) in case there is any potential issue, thus azure cloud provider could have more chance to retry, although in this case, it does not work perfectly. what's your opinion? |
Agreed. Let's wait a while for VMSS responses, in case there're still other potential issues. |
the failed error is most likely due to slow disk attach/detach on VMSS, let's merge this PR first since it would mitigate the issue a little |
…4398-upstream-release-1.13 Automated cherry pick of #74398: add retry for detach azure disk
…4398-upstream-release-1.11 Automated cherry pick of #74398: add retry for detach azure disk
…4398-upstream-release-1.12 Automated cherry pick of #74398: add retry for detach azure disk
@andyzhangx @feiskyer I see there was no final conclusion on the issue, but in some clusters we can't seem to be able to get out of that "AttachDiskWhileBeingDetached" error loop without deleting affected Pods and giving Azure enough time to clean up the mess (detach disks from their VMSS instance). |
@antoineco the |
@andyzhangx thanks for the quick answer, unfortunately I'm not familiar with the "RP" acronym sorry (😅). What I'm currently observing is a seemingly infinite loop of:
I've tried waiting for over an hour sometimes, but for some reason the retry logic is not enough. Detaching disks manually (e.g. via the |
@antoineco can you open an issue and also provide details info:
And what's the status of disk |
@antoineco you may also file an azure support ticket for this issue, and paste this github link, and azure team would analyze your issue first, pls also provide my suggested info, thanks. |
@andyzhangx I will open an issue, just let me drop that here first for future reference:
|
pls also cherry pick this PR: #74398, it fixed this issue: BTW, recently we found slow disk attach/detach issue when disk num is large, how many disks are attached to one VM? |
It's been cherry-picked but I documented it as #74581 (all references are to 1.12 cherry-picks). Maybe I lost something important in the conflict resolution and need to review again. Each node usually has about 5 disks attached, which should be fairly reasonable. |
Ref. #78172 |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Current azure cloud provider would fail to detach azure disk when there is server side error, need to add retry mechanism for detach disk operation, while for attach disk operation, it's not necessary since k8s pv-controller will try if first try failed.
Which issue(s) this PR fixes:
Fixes #74396
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/kind bug
/assign @feiskyer
/priority important-soon
/sig azure