Fix engine migration crash #2275

ejweber · 2023-10-30T22:01:34Z

Just be6aca6 is sufficient to prevent the engine from crashing under the circumstances from longhorn/longhorn#6961 (comment). However, this commit alone leads to orphaned data once the migration is complete (as the replica we refuse to add to the engine becomes inactive without an active counterpart).

I experimented with a path to making use of this newly inactive replica after the migration (to avoid unnecessary rebuilding), but I ultimately decided that the migration logic refactor this required was outside the scope of the ticket.

The other commits are an attempt to ensure we don't orphan the newly inactive replica's directory by providing a path to cleaning it up.

I need more time to test this because I'm running into issues with backing image download checksums. For some reason, the download of the backing image to the uncordoned node is consistently failing with checksum issues.

ejweber · 2023-10-31T18:41:09Z

With this PR, the following modifications in steps 12 and 13 from longhorn/longhorn#6961 (comment) are observed:

# The pod never experiences an I/O error.
eweber@laptop:~/longhorn> k get pod
NAME       READY   STATUS    RESTARTS   AGE
test-pod   1/1     Running   0          10m

eweber@laptop:~/longhorn> k logs test-pod
1+0 records in
1+0 records out
...

# The migration process ISN'T confused by the state of the waiting replica.
[longhorn-manager-jrzpp] time="2023-10-31T18:27:35Z" level=warning msg="Running replica pvc-3428a4f1-463c-4f27-a536-06050ef6e828-r-fd71a95d wasn't added to engine, will ignore it and continue migration" func="controller.(*VolumeController).prepareReplicasAndEngineForMigration" file="volume_controller.go:4000" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-3428a4f1-463c-4f27-a536-06050ef6e828-e-1 migrationNodeID=eweber-v125-worker-e472db53-xjmzr node=eweber-v125-worker-e472db53-9kz5b owner=eweber-v125-worker-e472db53-9kz5b shareEndpoint= shareState= state=attached volume=pvc-3428a4f1-463c-4f27-a536-06050ef6e828
[longhorn-manager-jrzpp] time="2023-10-31T18:27:35Z" level=warning msg="Replica is running, but can't be added while migration is ongoing" func="controller.(*VolumeController).openVolumeDependentResources" file="volume_controller.go:1763" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true node=eweber-v125-worker-e472db53-9kz5b owner=eweber-v125-worker-e472db53-9kz5b replica=pvc-3428a4f1-463c-4f27-a536-06050ef6e828-r-fd71a95d shareEndpoint= shareState= state=attached volume=pvc-3428a4f1-463c-4f27-a536-06050ef6e828

When the manually created volume attachment is deleted (kubectl delete test-va-1), the migration completes without any further disruption. A replica rebuilds to replace the one being held out, and neither the pod nor the engine ever crash.

# The waiting replica is cleaned up when migration is complete.
[longhorn-manager-jrzpp] time="2023-10-31T18:28:30Z" level=info msg="Cleaning up replica" func="controller.(*ReplicaController).syncReplica" file="replica_controller.go:321" controller=longhorn-replica node=eweber-v125-worker-e472db53-9kz5b nodeID=eweber-v125-worker-e472db53-9kz5b ownerID=eweber-v125-worker-e472db53-9kz5b replica=pvc-3428a4f1-463c-4f27-a536-06050ef6e828-r-fd71a95d

The good:

No I/O errors.
No confusing logs.
No orphaned replica directories. (Previous versions of the PR left them behind.)

The bad:

A replica must be completely rebuild on the affected node. (Previously, the migration engine would crash, but we would not lose that replica.)

Longhorn 6961 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber · 2023-10-31T21:19:27Z

I also backported and tested in Harvester. I was not able to reproduce I/O errors with the fix (though the stuck migration still persists).

ejweber · 2023-11-01T14:31:37Z

Passed end-to-end tests: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5195/.

shuo-wu

In general LGTM

controller/volume_controller.go

controller/replica_controller.go

…erpart Longhorn 6961 Signed-off-by: Eric Weber <eric.weber@suse.com>

innobead · 2023-11-12T16:20:50Z

controller/replica_controller.go

@@ -303,11 +303,18 @@ func (rc *ReplicaController) syncReplica(key string) (err error) {
 			return errors.Wrapf(err, "failed to cleanup the related replica instance before deleting replica %v", replica.Name)
 		}

+		rs, err := rc.ds.ListReplicasByNodeRO(replica.Spec.NodeID)


nit: rs can be initialized later in the following else if replica.Spec.NodeID != "" condition.

innobead · 2023-11-12T16:21:47Z

@mergify backport v1.5.x v1.4.x

mergify · 2023-11-12T16:21:58Z

backport v1.5.x v1.4.x

✅ Backports have been created

#2286 Fix engine migration crash (backport #2275) has been created for branch v1.5.x but encountered conflicts
#2287 Fix engine migration crash (backport #2275) has been created for branch v1.4.x but encountered conflicts

ejweber force-pushed the 6961-fix-migration-engine-crash branch from be6aca6 to c471422 Compare October 31, 2023 18:29

ejweber force-pushed the 6961-fix-migration-engine-crash branch from c471422 to 5f98944 Compare October 31, 2023 18:45

ejweber added 2 commits October 31, 2023 13:45

Don't add new rebuilding replicas to old engine while migrating

026ae3d

Longhorn 6961 Signed-off-by: Eric Weber <eric.weber@suse.com>

Clean up pointless error return

13818a5

Longhorn 6961 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber force-pushed the 6961-fix-migration-engine-crash branch from 5f98944 to ffc8a93 Compare October 31, 2023 18:45

ejweber marked this pull request as ready for review October 31, 2023 18:45

ejweber requested a review from a team as a code owner October 31, 2023 18:45

ejweber mentioned this pull request Oct 31, 2023

[BUG] A race after a node reboot leads to I/O errors with migratable volumes longhorn/longhorn#6961

Closed

ejweber requested review from shuo-wu and PhanLe1010 October 31, 2023 18:54

ejweber mentioned this pull request Nov 1, 2023

[BUG] Transfer of larger backing image never finishes if the receiving disk has poor I/O performance longhorn/longhorn#6536

Open

shuo-wu previously approved these changes Nov 2, 2023

View reviewed changes

controller/volume_controller.go Show resolved Hide resolved

controller/replica_controller.go Outdated Show resolved Hide resolved

Don't orphan inactive replica directories if there is no active count…

5326131

…erpart Longhorn 6961 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber dismissed shuo-wu’s stale review via 5326131 November 2, 2023 15:20

ejweber force-pushed the 6961-fix-migration-engine-crash branch from ffc8a93 to 5326131 Compare November 2, 2023 15:20

innobead requested a review from derekbit November 12, 2023 10:39

Merge branch 'master' into 6961-fix-migration-engine-crash

b92d71f

innobead self-requested a review November 12, 2023 16:17

innobead approved these changes Nov 12, 2023

View reviewed changes

innobead merged commit d3673df into longhorn:master Nov 12, 2023
5 checks passed

This was referenced Nov 12, 2023

Fix engine migration crash (backport #2275) #2286

Merged

Fix engine migration crash (backport #2275) #2287

Merged

ejweber mentioned this pull request Nov 13, 2023

[BACKPORT][v1.4.5][BUG] A race after a node reboot leads to I/O errors with migratable volumes longhorn/longhorn#7080

Open

yangchiu mentioned this pull request Nov 14, 2023

[BACKPORT][v1.5.3][BUG] A race after a node reboot leads to I/O errors with migratable volumes longhorn/longhorn#7081

Closed

ejweber mentioned this pull request Dec 21, 2023

Fix: data lost during engine upgrade/migration #2392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix engine migration crash #2275

Fix engine migration crash #2275

ejweber commented Oct 30, 2023

ejweber commented Oct 31, 2023 •

edited

Loading

ejweber commented Oct 31, 2023

ejweber commented Nov 1, 2023 •

edited

Loading

shuo-wu left a comment

innobead Nov 12, 2023

innobead commented Nov 12, 2023

mergify bot commented Nov 12, 2023 •

edited

Loading

Fix engine migration crash #2275

Fix engine migration crash #2275

Conversation

ejweber commented Oct 30, 2023

ejweber commented Oct 31, 2023 • edited Loading

ejweber commented Oct 31, 2023

ejweber commented Nov 1, 2023 • edited Loading

shuo-wu left a comment

Choose a reason for hiding this comment

innobead Nov 12, 2023

Choose a reason for hiding this comment

innobead commented Nov 12, 2023

mergify bot commented Nov 12, 2023 • edited Loading

✅ Backports have been created

ejweber commented Oct 31, 2023 •

edited

Loading

ejweber commented Nov 1, 2023 •

edited

Loading

mergify bot commented Nov 12, 2023 •

edited

Loading