Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot detach the restored volume when there is a node goes down during restoring #2103

Closed
PhanLe1010 opened this issue Dec 15, 2020 · 7 comments
Assignees
Labels
area/resilience System or volume resilience area/volume-attach-detach Volume attach & detach related area/volume-backup-restore Volume backup restore component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated
Milestone

Comments

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Dec 15, 2020

Describe the bug
Create a restore volume, restored-vol, from a backup.
During the restoring process, if a node that contains one replica of the restored-vol goes down, the volume finishes restoring and remain attaching to forever. Cannot manually detach the volume.

To Reproduce
Steps to reproduce the behavior:

  1. Setup a backupstore outside of the cluster, e.g. AWS S3. We want to make sure that the backupstore is not affected when we turn down some nodes.
  2. Create a restore volume, restored-vol, from 1.5GB backup.
  3. During the restoring process, turn off a node that contains a replica of restored-vol
  4. Observe that restored-vol finishes doing restoring but never detach. Cannot manually detach the volume.

Expected behavior
restored-vol is detached after finish restoring

Environment:

  • Longhorn version: v1.1.0-rc2
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s v1.19.5+k3s1
  • Node config
    • OS type and version: Ubuntu 18.04
    • CPU per node: 2
    • Memory per node: 4GB
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth and latency between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Digital Ocean

Additional context
Looks like we need to reconsider the logic at https://github.com/longhorn/longhorn-manager/blob/eb98fc29d8ab37ec3c0650150a75d73ed22a4f93/controller/volume_controller.go#L1854
When there is a replica that is not in e.Status.ReplicaModeMap, allScheduledReplicasIncluded is always set to false which prevent the volume from detaching.

@yasker yasker added this to the v1.1.1 milestone Dec 15, 2020
@yasker yasker added component/longhorn-manager Longhorn manager (control plane) kind/bug area/volume-backup-restore Volume backup restore labels Dec 15, 2020
@shuo-wu
Copy link
Contributor

shuo-wu commented Dec 16, 2020

Since Longhorn is waiting for the failed replica rebuild, the auto detachment will be disabled. In other words, when all scheduled but failed replicas are cleaned up, the auto detachment will be applied: https://github.com/longhorn/longhorn-manager/blob/eb98fc29d8ab37ec3c0650150a75d73ed22a4f93/controller/volume_controller.go#L1832

In Brief, the Workaround for this issue is:

  • Delete all failed replicas for the restore/DR volumes

The following steps can also trigger this case:

  1. Create a restore volume
  2. During the restore progress, disable a node scheduling then crash the replica process on the node.
  3. The restore volume will stay attached even if other running replicas finish the restore

Similar to the new replica replenishment delay logic, the auto detachment delay is what we expect. Otherwise, Longhorn may need to start rebuilding immediately when users try to use restored volumes. This hurt the use experience. In order to avoid a long wait for the restore volumes, we will apply an enhancement later: #1512

@innobead innobead modified the milestones: v1.1.1, v1.1.2 Apr 14, 2021
@innobead
Copy link
Member

innobead commented Apr 29, 2021

Similar to the new replica replenishment delay logic, the auto detachment delay is what we expect. Otherwise, Longhorn may need to start rebuilding immediately when users try to use restored volumes. This hurt the use experience. In order to avoid a long wait for the restore volumes, we will apply an enhancement later: #1512

@shuo-wu
To confirm, in the end, after the delay, the auto detachment will be reenabled to make the restored volume detached successfully? If yes, #1512 should be good enough.

@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 29, 2021

Yes. But maybe we need to check this after the volume refactor: #2527

@innobead innobead modified the milestones: v1.1.2, v1.2.0 Apr 29, 2021
@innobead innobead modified the milestones: v1.2.0, v1.3.0 Aug 12, 2021
@innobead innobead modified the milestones: v1.3.0, v1.4.0 Mar 31, 2022
@innobead innobead modified the milestones: v1.4.0, v1.5.0 Nov 7, 2022
@innobead innobead added priority/1 Highly recommended to implement or fix in this release (managed by PO) area/volume-attach-detach Volume attach & detach related labels Apr 6, 2023
@innobead innobead modified the milestones: v1.5.0, v1.6.0 May 3, 2023
@innobead innobead added priority/0 Must be implement or fixed in this release (managed by PO) area/resilience System or volume resilience and removed priority/1 Highly recommended to implement or fix in this release (managed by PO) labels Sep 14, 2023
@ejweber
Copy link
Contributor

ejweber commented Oct 16, 2023

This appears to be a no-op due to #1512 (which ended up going in at about the same time as #2527).

Test 1:

Conditions:

  • master-head
  • allow-volume-creation-with-degraded-availability: true

Result:

  • Restore finishes.
  • Restore volume detaches.
  • Two replicas show as stopped but with "healthyAt" set.
  • One replica (on the failed node) shows as stopped but with "failedAt" set.
  • Degraded restore volume can be attached.
  • Degraded restore volume contains correct data.

This is working as expected. The restore volume should auto-detach when the restore finishes even though it is degraded due to the added setting.

Test 2:

Conditions:

  • master-head
  • allow-volume-creation-with-degraded-availability: false

Initial result:

  • Restore finishes.
  • Restore volume does not detach.
  • Two replicas show as running.
  • One replica (on the failed node) shows as failed.
  • Degraded restore volume can be attached.
  • Degraded restore volume contains correct data.

Long-term result:

  • Time since replica failure exceeds replica-replenishment-wait-interval.
  • A new replica rebuilds for restore volume (there are enough disks in my cluster to allow this).
  • Restore volume detaches.
  • Three replicas show as stopped but with "healthyAt" set.
  • One replica (on the failed node) shows as stopped but with "failedAt" set.
  • Restore volume can be attached.
  • Restore volume contains correct data.
  • Failed replica can be deleted safely.

This is working as expected. As long as it is degraded, the restore volume cannot detach (or be used for a workload). We just have to wait for replica-replenishment-wait-interval for Longhorn to rebuild a new replica. Until then, it is waiting for the existing failed replica to come back online.

@ejweber
Copy link
Contributor

ejweber commented Oct 16, 2023

We probably should have a test like longhorn/longhorn-tests#1394 in order to verify this behavior. I'll work on one as the action-item for completing this ticket.

Check if the following from #6061 already implement the desired behavior (probably not, because it doesn't look like they handle a scenario in which the failed replica cannot be immediately reused).

  • test_single_replica_restore_failure
  • test_rebuild_with_restoration

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Oct 20, 2023

Pre Ready-For-Testing Checklist

@chriscchien chriscchien self-assigned this Nov 28, 2023
@chriscchien
Copy link
Contributor

Verified pass on longhorn master (longhorn bcd399) with test steps

Precondition : Restore a 3 GB volume from actual S3 and stop a node that has a replica running when the restore is in progress

  1. allow-volume-creation-with-degraded-availability: true
    • Volume detached after restoration
    • Data correct
  2. allow-volume-creation-with-degraded-availability: false
    • Volume degraded and not ready for workload.
    • Start the stopped node. volume detached.
    • Data correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resilience System or volume resilience area/volume-attach-detach Volume attach & detach related area/volume-backup-restore Volume backup restore component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated
Projects
Status: Closed
Development

No branches or pull requests

7 participants