-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(backup): set backup error if backup target invalid #2086
Conversation
Related to (and possibly fixes) longhorn/longhorn#6358. cc @PhanLe1010 |
Yeah, this is one way to fix it. I am fine with either this approach or longhorn/longhorn#6358 (comment). Leaving it to @mantissahz and @ejweber to make the call :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we do this, the backup can never complete, even if the backup target becomes valid later.
longhorn-manager/controller/backup_controller.go
Lines 347 to 354 in ce99ddf
// Perform backup snapshot to the remote backup target | |
// If the Backup CR is created by the user/API layer (spec.snapshotName != "") and has not been synced (status.lastSyncedAt == ""), | |
// it means creating a backup from a volume snapshot is required. | |
// Hence the source of truth is the engine/replica and the controller needs to sync the status with it. | |
// Otherwise, the Backup CR is created by the backup volume controller, which means the backup already | |
// exists in the remote backup target before the CR creation. | |
// What the controller needs to do for this case is retrieve the info from the remote backup target. | |
if backup.Status.LastSyncedAt.IsZero() && backup.Spec.SnapshotName != "" { |
That was fine for me in #6358 because the snapshot didn't exist and the situation was not recoverable.
Do we need to distinguish between recoverable and unrecoverable situations?
It is fine to me |
I am taking it back. May be this backup target not setup error should be recoverable and be retried |
@PhanLe1010 Do you mean that we should have some retries for the backup when encountering a backup target is invalid to check if the backup target is recoverable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR itself LGTM (without considering longhorn/longhorn#6358 (comment)).
Without this fix, the backup CR can be never handled. The reconcile loop will directly error out rather than updating the CR state to Error
.
When starting a backup monitor, it will start to take a snapshot backup procedure and it will be failed if the s3 backup target is invalid (nfs backup target will be not be failed because of timeout and test the backup when it is in progress.) Set the backup state as `Error` if starting the backup procedure returns an error. Ref: 1249 Signed-off-by: James Lu <james.lu@suse.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this and will test it as the fix to longhorn/longhorn#6358 when it merges.
I am still thinking it would be better to have the ability to retry if the backup target becomes valid. But to @shuo-wu's point, without making this change, we can never know a backup is in error and we will keep silently trying over and over to do an impossible thing (this is the underlying complaint in longhorn/longhorn#6358). Maybe in the future we can consider an addition to backup monitor reconcile loop that can recognize if a backup's error is related to an invalid backup target and retrying if the backup target is now valid.
@mergify backport v1.5.x v1.4.x v1.3.x |
✅ Backports have been created
|
Agreed with this point. @ejweber please create a ticket for this. This should be part of how resilient the system/function we want to make. Any operation should be considered with retry, backoff, etc if suitable/could. Surely, the audit should be clear during the resilience period. |
When starting a backup monitor, it will start to take a snapshot backup procedure and it will be failed if the s3 backup target is invalid (nfs backup target will be not be failed because of timeout and test the backup when it is in progress.)
Set the backup state as
Error
if starting the backup procedure returns an error.Ref: longhorn/longhorn#1249