Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add retry for detach azure disk #74398

Merged
merged 1 commit into from
Feb 26, 2019

Conversation

andyzhangx
Copy link
Member

What type of PR is this?
/kind bug

What this PR does / why we need it:
Current azure cloud provider would fail to detach azure disk when there is server side error, need to add retry mechanism for detach disk operation, while for attach disk operation, it's not necessary since k8s pv-controller will try if first try failed.

Which issue(s) this PR fixes:

Fixes #74396

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix issue: fail to detach azure disk when there is server side error

/kind bug
/assign @feiskyer
/priority important-soon
/sig azure

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 22, 2019
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/azure cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 22, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 22, 2019
@k8s-ci-robot k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Feb 22, 2019
add more logging info in detach disk
@andyzhangx andyzhangx force-pushed the detach-azuredisk-retry branch from df2edd2 to 8c53db0 Compare February 22, 2019 07:45
Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2019
@andyzhangx
Copy link
Member Author

andyzhangx commented Feb 22, 2019

BTW, if move 8 disks from one node to another in parellel, there will be always one disk not detched(only one), the error is like following:

I0222 08:38:22.503087       1 azure_controller_vmss.go:140] azureDisk - update(andy-vmss1124) backing off: vm(k8s-agentpool-24194600-vmss000001) detach disk(, /subscriptions/xxx/resourceGroups/andy-vmss1124/providers/Microsoft.Compute/disks/andy-vmss1124-dynamic-pvc-66815035-35ac-11e9-921c-000d3a002125), err: compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk '763c0d05-4ae5-4699-8ce8-9c8ced5283da' to VM 'k8s-agentpool-24194600-vmss_1' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."

The detach action finally failed, it's more like attach a disk to node#1 would cause detach that disk from node#2 to fail which is not reasonable, I will contact with azure disk RP to check with this issue.

@feiskyer
Copy link
Member

Then let's hold a while for root causes.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 22, 2019
@feiskyer
Copy link
Member

/test pull-kubernetes-e2e-aks-engine-azure

@andyzhangx
Copy link
Member Author

the funny thing is Cannot attach data disk '763c0d05-4ae5-4699-8ce8-9c8ced5283da' to VM 'k8s-agentpool-24194600-vmss_1', while I never have data disk '763c0d05-4ae5-4699-8ce8-9c8ced5283da' in my resource group, and I could repro that easily in two vmss k8s clusters.

@andyzhangx
Copy link
Member Author

@feiskyer The issue I could repro is related to VMSS, while I still insist on adding this retry logic only for detach disk (not necessary for attach disk since k8s controller will retry if failed) in case there is any potential issue, thus azure cloud provider could have more chance to retry, although in this case, it does not work perfectly. what's your opinion?

@feiskyer
Copy link
Member

The issue I could repro is related to VMSS, while I still insist on adding this retry logic only for detach disk (not necessary for attach disk since k8s controller will retry if failed) in case there is any potential issue, thus azure cloud provider could have more chance to retry, although in this case, it does not work perfectly. what's your opinion?

Agreed. Let's wait a while for VMSS responses, in case there're still other potential issues.

@andyzhangx
Copy link
Member Author

the failed error is most likely due to slow disk attach/detach on VMSS, let's merge this PR first since it would mitigate the issue a little
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 26, 2019
@k8s-ci-robot k8s-ci-robot merged commit d500740 into kubernetes:master Feb 26, 2019
k8s-ci-robot added a commit that referenced this pull request Feb 27, 2019
…4398-upstream-release-1.13

Automated cherry pick of #74398: add retry for detach azure disk
k8s-ci-robot added a commit that referenced this pull request Mar 5, 2019
…4398-upstream-release-1.11

Automated cherry pick of #74398: add retry for detach azure disk
k8s-ci-robot added a commit that referenced this pull request Mar 7, 2019
…4398-upstream-release-1.12

Automated cherry pick of #74398: add retry for detach azure disk
@antoineco
Copy link
Contributor

@andyzhangx @feiskyer I see there was no final conclusion on the issue, but in some clusters we can't seem to be able to get out of that "AttachDiskWhileBeingDetached" error loop without deleting affected Pods and giving Azure enough time to clean up the mess (detach disks from their VMSS instance).

@andyzhangx
Copy link
Member Author

@antoineco the AttachDiskWhileBeingDetached issue on VMSS is already fixed in VMSS RP in mid April. This PR fixed the issue when disk detach operation failed and the disk is never detached on the original node.

@antoineco
Copy link
Contributor

antoineco commented May 20, 2019

@andyzhangx thanks for the quick answer, unfortunately I'm not familiar with the "RP" acronym sorry (😅).

What I'm currently observing is a seemingly infinite loop of:

Code="AttachDiskWhileBeingDetached"
Message="Cannot attach data disk 'foo-bar-123' to VM 'mycluster-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."

I've tried waiting for over an hour sometimes, but for some reason the retry logic is not enough. Detaching disks manually (e.g. via the az CLI) returns the same error.

@andyzhangx
Copy link
Member Author

@antoineco can you open an issue and also provide details info:

  1. k8s version
  2. region

And what's the status of disk foo-bar-123? is it attached or unattached?
if that disk is attached, you could also provide info by az vmss show -g <RESOURCE_GROUP_NAME> --name <VMSS_NAME> --instance-id <ID(number)> to show which VM is that disk attached to.

@andyzhangx
Copy link
Member Author

@antoineco you may also file an azure support ticket for this issue, and paste this github link, and azure team would analyze your issue first, pls also provide my suggested info, thanks.

@antoineco
Copy link
Contributor

antoineco commented May 20, 2019

@andyzhangx I will open an issue, just let me drop that here first for future reference:

  • Kubernetes version: it's complicated (I'll explain, but we're testing 1.12 too)
  • Regions: eastus, westeurope

"diskState": "Unattached"
(sometimes the disk is attached and impossible to detach for > 20 min)

@andyzhangx
Copy link
Member Author

andyzhangx commented May 20, 2019

@andyzhangx I will open an issue, just let me drop that here first for future reference:

  • Kubernetes version: it's complicated (I'll explain, but we're testing 1.12 too)
  • Regions: eastus, westeurope

"diskState": "Unattached"
(sometimes the disk is attached and impossible to detach for > 20 min)

pls also cherry pick this PR: #74398, it fixed this issue: (sometimes the disk is attached and impossible to detach for > 20 min), only need to patch the kube-controller-manager

BTW, recently we found slow disk attach/detach issue when disk num is large, how many disks are attached to one VM?

@antoineco
Copy link
Contributor

It's been cherry-picked but I documented it as #74581 (all references are to 1.12 cherry-picks). Maybe I lost something important in the conflict resolution and need to review again.

Each node usually has about 5 disks attached, which should be fairly reasonable.

@antoineco
Copy link
Contributor

Ref. #78172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fail to detach azure disk when there is server side error
4 participants