[BUG] Cannot detach the restored volume when there is a node goes down during restoring #2103

PhanLe1010 · 2020-12-15T23:53:21Z

Describe the bug
Create a restore volume, restored-vol, from a backup.
During the restoring process, if a node that contains one replica of the restored-vol goes down, the volume finishes restoring and remain attaching to forever. Cannot manually detach the volume.

To Reproduce
Steps to reproduce the behavior:

Setup a backupstore outside of the cluster, e.g. AWS S3. We want to make sure that the backupstore is not affected when we turn down some nodes.
Create a restore volume, restored-vol, from 1.5GB backup.
During the restoring process, turn off a node that contains a replica of restored-vol
Observe that restored-vol finishes doing restoring but never detach. Cannot manually detach the volume.

Expected behavior
restored-vol is detached after finish restoring

Environment:

Longhorn version: v1.1.0-rc2
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s v1.19.5+k3s1
Node config
- OS type and version: Ubuntu 18.04
- CPU per node: 2
- Memory per node: 4GB
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth and latency between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Digital Ocean

Additional context
Looks like we need to reconsider the logic at https://github.com/longhorn/longhorn-manager/blob/eb98fc29d8ab37ec3c0650150a75d73ed22a4f93/controller/volume_controller.go#L1854
When there is a replica that is not in e.Status.ReplicaModeMap, allScheduledReplicasIncluded is always set to false which prevent the volume from detaching.

The text was updated successfully, but these errors were encountered:

shuo-wu · 2020-12-16T12:49:46Z

Since Longhorn is waiting for the failed replica rebuild, the auto detachment will be disabled. In other words, when all scheduled but failed replicas are cleaned up, the auto detachment will be applied: https://github.com/longhorn/longhorn-manager/blob/eb98fc29d8ab37ec3c0650150a75d73ed22a4f93/controller/volume_controller.go#L1832

In Brief, the Workaround for this issue is:

Delete all failed replicas for the restore/DR volumes

The following steps can also trigger this case:

Create a restore volume
During the restore progress, disable a node scheduling then crash the replica process on the node.
The restore volume will stay attached even if other running replicas finish the restore

Similar to the new replica replenishment delay logic, the auto detachment delay is what we expect. Otherwise, Longhorn may need to start rebuilding immediately when users try to use restored volumes. This hurt the use experience. In order to avoid a long wait for the restore volumes, we will apply an enhancement later: #1512

innobead · 2021-04-29T07:46:20Z

Similar to the new replica replenishment delay logic, the auto detachment delay is what we expect. Otherwise, Longhorn may need to start rebuilding immediately when users try to use restored volumes. This hurt the use experience. In order to avoid a long wait for the restore volumes, we will apply an enhancement later: #1512

@shuo-wu
To confirm, in the end, after the delay, the auto detachment will be reenabled to make the restored volume detached successfully? If yes, #1512 should be good enough.

shuo-wu · 2021-04-29T11:48:24Z

Yes. But maybe we need to check this after the volume refactor: #2527

ejweber · 2023-10-16T21:20:31Z

This appears to be a no-op due to #1512 (which ended up going in at about the same time as #2527).

Test 1:

Conditions:

master-head
allow-volume-creation-with-degraded-availability: true

Result:

Restore finishes.
Restore volume detaches.
Two replicas show as stopped but with "healthyAt" set.
One replica (on the failed node) shows as stopped but with "failedAt" set.
Degraded restore volume can be attached.
Degraded restore volume contains correct data.

This is working as expected. The restore volume should auto-detach when the restore finishes even though it is degraded due to the added setting.

Test 2:

Conditions:

master-head
allow-volume-creation-with-degraded-availability: false

Initial result:

Restore finishes.
Restore volume does not detach.
Two replicas show as running.
One replica (on the failed node) shows as failed.
Degraded restore volume can be attached.
Degraded restore volume contains correct data.

Long-term result:

Time since replica failure exceeds replica-replenishment-wait-interval.
A new replica rebuilds for restore volume (there are enough disks in my cluster to allow this).
Restore volume detaches.
Three replicas show as stopped but with "healthyAt" set.
One replica (on the failed node) shows as stopped but with "failedAt" set.
Restore volume can be attached.
Restore volume contains correct data.
Failed replica can be deleted safely.

This is working as expected. As long as it is degraded, the restore volume cannot detach (or be used for a workload). We just have to wait for replica-replenishment-wait-interval for Longhorn to rebuild a new replica. Until then, it is waiting for the existing failed replica to come back online.

ejweber · 2023-10-16T21:25:45Z

We probably should have a test like longhorn/longhorn-tests#1394 in order to verify this behavior. I'll work on one as the action-item for completing this ticket.

Check if the following from #6061 already implement the desired behavior (probably not, because it doesn't look like they handle a scenario in which the failed replica cannot be immediately reused).

test_single_replica_restore_failure
test_rebuild_with_restoration

longhorn-io-github-bot · 2023-10-20T22:01:34Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at: [BUG] Cannot detach the restored volume when there is a node goes down during restoring #2103 (comment).
Does the PR include the explanation for the fix or the feature?
If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
The automation test case PR is at: Refactor restoration tests with replica failure and add one longhorn-tests#1552.
The issue of automation test case implementation is at: [TEST][BUG] Cannot detach the restored volume when there is a node goes down during restoring #6905.

chriscchien · 2023-11-28T03:20:00Z

Verified pass on longhorn master (longhorn bcd399) with test steps

Precondition : Restore a 3 GB volume from actual S3 and stop a node that has a replica running when the restore is in progress

allow-volume-creation-with-degraded-availability: true
- Volume detached after restoration
- Data correct
allow-volume-creation-with-degraded-availability: false
- Volume degraded and not ready for workload.
- Start the stopped node. volume detached.
- Data correct

yasker added this to the v1.1.1 milestone Dec 15, 2020

yasker added component/longhorn-manager Longhorn manager (control plane) kind/bug area/volume-backup-restore Volume backup restore labels Dec 15, 2020

innobead modified the milestones: v1.1.1, v1.1.2 Apr 14, 2021

innobead modified the milestones: v1.1.2, v1.2.0 Apr 29, 2021

innobead modified the milestones: v1.2.0, v1.3.0 Aug 12, 2021

innobead modified the milestones: v1.3.0, v1.4.0 Mar 31, 2022

innobead modified the milestones: v1.4.0, v1.5.0 Nov 7, 2022

innobead assigned ejweber Apr 6, 2023

innobead added priority/1 Highly recommended to implement or fix in this release (managed by PO) area/volume-attach-detach Volume attach & detach related labels Apr 6, 2023

innobead modified the milestones: v1.5.0, v1.6.0 May 3, 2023

innobead added priority/0 Must be implement or fixed in this release (managed by PO) area/resilience System or volume resilience and removed priority/1 Highly recommended to implement or fix in this release (managed by PO) labels Sep 14, 2023

ejweber added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label Oct 16, 2023

github-actions bot mentioned this issue Oct 16, 2023

[TEST][BUG] Cannot detach the restored volume when there is a node goes down during restoring #6905

Closed

ejweber mentioned this issue Oct 20, 2023

Refactor restoration tests with replica failure and add one longhorn/longhorn-tests#1552

Merged

chriscchien self-assigned this Nov 28, 2023

chriscchien closed this as completed Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cannot detach the restored volume when there is a node goes down during restoring #2103

[BUG] Cannot detach the restored volume when there is a node goes down during restoring #2103

PhanLe1010 commented Dec 15, 2020 •

edited

Loading

shuo-wu commented Dec 16, 2020 •

edited

Loading

innobead commented Apr 29, 2021 •

edited

Loading

shuo-wu commented Apr 29, 2021

ejweber commented Oct 16, 2023

ejweber commented Oct 16, 2023 •

edited

Loading

longhorn-io-github-bot commented Oct 20, 2023 •

edited by ejweber

Loading

chriscchien commented Nov 28, 2023

[BUG] Cannot detach the restored volume when there is a node goes down during restoring #2103

[BUG] Cannot detach the restored volume when there is a node goes down during restoring #2103

Comments

PhanLe1010 commented Dec 15, 2020 • edited Loading

shuo-wu commented Dec 16, 2020 • edited Loading

innobead commented Apr 29, 2021 • edited Loading

shuo-wu commented Apr 29, 2021

ejweber commented Oct 16, 2023

Test 1:

Test 2:

ejweber commented Oct 16, 2023 • edited Loading

longhorn-io-github-bot commented Oct 20, 2023 • edited by ejweber Loading

Pre Ready-For-Testing Checklist

chriscchien commented Nov 28, 2023

PhanLe1010 commented Dec 15, 2020 •

edited

Loading

shuo-wu commented Dec 16, 2020 •

edited

Loading

innobead commented Apr 29, 2021 •

edited

Loading

ejweber commented Oct 16, 2023 •

edited

Loading

longhorn-io-github-bot commented Oct 20, 2023 •

edited by ejweber

Loading