-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix engine migration crash #2275
Fix engine migration crash #2275
Conversation
be6aca6
to
c471422
Compare
With this PR, the following modifications in steps 12 and 13 from longhorn/longhorn#6961 (comment) are observed:
When the manually created volume attachment is deleted (
The good:
The bad:
|
c471422
to
5f98944
Compare
Longhorn 6961 Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 6961 Signed-off-by: Eric Weber <eric.weber@suse.com>
5f98944
to
ffc8a93
Compare
I also backported and tested in Harvester. I was not able to reproduce I/O errors with the fix (though the stuck migration still persists). |
Passed end-to-end tests: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5195/. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM
…erpart Longhorn 6961 Signed-off-by: Eric Weber <eric.weber@suse.com>
ffc8a93
to
5326131
Compare
@@ -303,11 +303,18 @@ func (rc *ReplicaController) syncReplica(key string) (err error) { | |||
return errors.Wrapf(err, "failed to cleanup the related replica instance before deleting replica %v", replica.Name) | |||
} | |||
|
|||
rs, err := rc.ds.ListReplicasByNodeRO(replica.Spec.NodeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rs can be initialized later in the following else if replica.Spec.NodeID != ""
condition.
@mergify backport v1.5.x v1.4.x |
✅ Backports have been created
|
longhorn/longhorn#6961
Just be6aca6 is sufficient to prevent the engine from crashing under the circumstances from longhorn/longhorn#6961 (comment). However, this commit alone leads to orphaned data once the migration is complete (as the replica we refuse to add to the engine becomes inactive without an active counterpart).
I experimented with a path to making use of this newly inactive replica after the migration (to avoid unnecessary rebuilding), but I ultimately decided that the migration logic refactor this required was outside the scope of the ticket.
The other commits are an attempt to ensure we don't orphan the newly inactive replica's directory by providing a path to cleaning it up.
I need more time to test this because I'm running into issues with backing image download checksums. For some reason, the download of the backing image to the uncordoned node is consistently failing with checksum issues.