-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cannot detach the restored volume when there is a node goes down during restoring #2103
Comments
Since Longhorn is waiting for the failed replica rebuild, the auto detachment will be disabled. In other words, when all scheduled but failed replicas are cleaned up, the auto detachment will be applied: https://github.com/longhorn/longhorn-manager/blob/eb98fc29d8ab37ec3c0650150a75d73ed22a4f93/controller/volume_controller.go#L1832 In Brief, the Workaround for this issue is:
The following steps can also trigger this case:
Similar to the new replica replenishment delay logic, the auto detachment delay is what we expect. Otherwise, Longhorn may need to start rebuilding immediately when users try to use restored volumes. This hurt the use experience. In order to avoid a long wait for the restore volumes, we will apply an enhancement later: #1512 |
@shuo-wu |
Yes. But maybe we need to check this after the volume refactor: #2527 |
This appears to be a no-op due to #1512 (which ended up going in at about the same time as #2527). Test 1:Conditions:
Result:
This is working as expected. The restore volume should auto-detach when the restore finishes even though it is degraded due to the added setting. Test 2:Conditions:
Initial result:
Long-term result:
This is working as expected. As long as it is degraded, the restore volume cannot detach (or be used for a workload). We just have to wait for replica-replenishment-wait-interval for Longhorn to rebuild a new replica. Until then, it is waiting for the existing failed replica to come back online. |
We probably should have a test like longhorn/longhorn-tests#1394 in order to verify this behavior. I'll work on one as the action-item for completing this ticket. Check if the following from #6061 already implement the desired behavior (probably not, because it doesn't look like they handle a scenario in which the failed replica cannot be immediately reused).
|
Pre Ready-For-Testing Checklist
|
Verified pass on longhorn master (longhorn Precondition : Restore a 3 GB volume from actual S3 and stop a node that has a replica running when the restore is in progress
|
Describe the bug
Create a restore volume,
restored-vol
, from a backup.During the restoring process, if a node that contains one replica of the
restored-vol
goes down, the volume finishes restoring and remain attaching to forever. Cannot manually detach the volume.To Reproduce
Steps to reproduce the behavior:
restored-vol
, from 1.5GB backup.restored-vol
restored-vol
finishes doing restoring but never detach. Cannot manually detach the volume.Expected behavior
restored-vol
is detached after finish restoringEnvironment:
Additional context
Looks like we need to reconsider the logic at https://github.com/longhorn/longhorn-manager/blob/eb98fc29d8ab37ec3c0650150a75d73ed22a4f93/controller/volume_controller.go#L1854
When there is a replica that is not in
e.Status.ReplicaModeMap
,allScheduledReplicasIncluded
is always set tofalse
which prevent the volume from detaching.The text was updated successfully, but these errors were encountered: