Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix engine migration crash #2275

Merged
merged 4 commits into from
Nov 12, 2023

Conversation

ejweber
Copy link
Contributor

@ejweber ejweber commented Oct 30, 2023

longhorn/longhorn#6961

Just be6aca6 is sufficient to prevent the engine from crashing under the circumstances from longhorn/longhorn#6961 (comment). However, this commit alone leads to orphaned data once the migration is complete (as the replica we refuse to add to the engine becomes inactive without an active counterpart).

I experimented with a path to making use of this newly inactive replica after the migration (to avoid unnecessary rebuilding), but I ultimately decided that the migration logic refactor this required was outside the scope of the ticket.

The other commits are an attempt to ensure we don't orphan the newly inactive replica's directory by providing a path to cleaning it up.

I need more time to test this because I'm running into issues with backing image download checksums. For some reason, the download of the backing image to the uncordoned node is consistently failing with checksum issues.

@ejweber ejweber force-pushed the 6961-fix-migration-engine-crash branch from be6aca6 to c471422 Compare October 31, 2023 18:29
@ejweber
Copy link
Contributor Author

ejweber commented Oct 31, 2023

With this PR, the following modifications in steps 12 and 13 from longhorn/longhorn#6961 (comment) are observed:

# The pod never experiences an I/O error.
eweber@laptop:~/longhorn> k get pod
NAME       READY   STATUS    RESTARTS   AGE
test-pod   1/1     Running   0          10m

eweber@laptop:~/longhorn> k logs test-pod
1+0 records in
1+0 records out
...

# The migration process ISN'T confused by the state of the waiting replica.
[longhorn-manager-jrzpp] time="2023-10-31T18:27:35Z" level=warning msg="Running replica pvc-3428a4f1-463c-4f27-a536-06050ef6e828-r-fd71a95d wasn't added to engine, will ignore it and continue migration" func="controller.(*VolumeController).prepareReplicasAndEngineForMigration" file="volume_controller.go:4000" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-3428a4f1-463c-4f27-a536-06050ef6e828-e-1 migrationNodeID=eweber-v125-worker-e472db53-xjmzr node=eweber-v125-worker-e472db53-9kz5b owner=eweber-v125-worker-e472db53-9kz5b shareEndpoint= shareState= state=attached volume=pvc-3428a4f1-463c-4f27-a536-06050ef6e828
[longhorn-manager-jrzpp] time="2023-10-31T18:27:35Z" level=warning msg="Replica is running, but can't be added while migration is ongoing" func="controller.(*VolumeController).openVolumeDependentResources" file="volume_controller.go:1763" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true node=eweber-v125-worker-e472db53-9kz5b owner=eweber-v125-worker-e472db53-9kz5b replica=pvc-3428a4f1-463c-4f27-a536-06050ef6e828-r-fd71a95d shareEndpoint= shareState= state=attached volume=pvc-3428a4f1-463c-4f27-a536-06050ef6e828

When the manually created volume attachment is deleted (kubectl delete test-va-1), the migration completes without any further disruption. A replica rebuilds to replace the one being held out, and neither the pod nor the engine ever crash.

# The waiting replica is cleaned up when migration is complete.
[longhorn-manager-jrzpp] time="2023-10-31T18:28:30Z" level=info msg="Cleaning up replica" func="controller.(*ReplicaController).syncReplica" file="replica_controller.go:321" controller=longhorn-replica node=eweber-v125-worker-e472db53-9kz5b nodeID=eweber-v125-worker-e472db53-9kz5b ownerID=eweber-v125-worker-e472db53-9kz5b replica=pvc-3428a4f1-463c-4f27-a536-06050ef6e828-r-fd71a95d

The good:

  • No I/O errors.
  • No confusing logs.
  • No orphaned replica directories. (Previous versions of the PR left them behind.)

The bad:

  • A replica must be completely rebuild on the affected node. (Previously, the migration engine would crash, but we would not lose that replica.)

@ejweber ejweber force-pushed the 6961-fix-migration-engine-crash branch from c471422 to 5f98944 Compare October 31, 2023 18:45
Longhorn 6961

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 6961

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber
Copy link
Contributor Author

ejweber commented Oct 31, 2023

I also backported and tested in Harvester. I was not able to reproduce I/O errors with the fix (though the stuck migration still persists).

@ejweber
Copy link
Contributor Author

ejweber commented Nov 1, 2023

Passed end-to-end tests: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5195/.

shuo-wu
shuo-wu previously approved these changes Nov 2, 2023
Copy link
Contributor

@shuo-wu shuo-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM

controller/volume_controller.go Show resolved Hide resolved
controller/replica_controller.go Outdated Show resolved Hide resolved
…erpart

Longhorn 6961

Signed-off-by: Eric Weber <eric.weber@suse.com>
@innobead innobead self-requested a review November 12, 2023 16:17
@@ -303,11 +303,18 @@ func (rc *ReplicaController) syncReplica(key string) (err error) {
return errors.Wrapf(err, "failed to cleanup the related replica instance before deleting replica %v", replica.Name)
}

rs, err := rc.ds.ListReplicasByNodeRO(replica.Spec.NodeID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rs can be initialized later in the following else if replica.Spec.NodeID != "" condition.

@innobead innobead merged commit d3673df into longhorn:master Nov 12, 2023
5 checks passed
@innobead
Copy link
Member

@mergify backport v1.5.x v1.4.x

Copy link

mergify bot commented Nov 12, 2023

backport v1.5.x v1.4.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants