Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backup): set backup error if backup target invalid #2086

Merged
merged 1 commit into from
Jul 27, 2023

Conversation

mantissahz
Copy link
Contributor

When starting a backup monitor, it will start to take a snapshot backup procedure and it will be failed if the s3 backup target is invalid (nfs backup target will be not be failed because of timeout and test the backup when it is in progress.)
Set the backup state as Error if starting the backup procedure returns an error.

Ref: longhorn/longhorn#1249

@ejweber
Copy link
Contributor

ejweber commented Jul 21, 2023

Related to (and possibly fixes) longhorn/longhorn#6358.

cc @PhanLe1010

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jul 21, 2023

Yeah, this is one way to fix it. I am fine with either this approach or longhorn/longhorn#6358 (comment).

Leaving it to @mantissahz and @ejweber to make the call :D

Copy link
Contributor

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we do this, the backup can never complete, even if the backup target becomes valid later.

// Perform backup snapshot to the remote backup target
// If the Backup CR is created by the user/API layer (spec.snapshotName != "") and has not been synced (status.lastSyncedAt == ""),
// it means creating a backup from a volume snapshot is required.
// Hence the source of truth is the engine/replica and the controller needs to sync the status with it.
// Otherwise, the Backup CR is created by the backup volume controller, which means the backup already
// exists in the remote backup target before the CR creation.
// What the controller needs to do for this case is retrieve the info from the remote backup target.
if backup.Status.LastSyncedAt.IsZero() && backup.Spec.SnapshotName != "" {

That was fine for me in #6358 because the snapshot didn't exist and the situation was not recoverable.

Do we need to distinguish between recoverable and unrecoverable situations?

@PhanLe1010
Copy link
Contributor

I think if we do this, the backup can never complete, even if the backup target becomes valid later.

It is fine to me

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jul 21, 2023

I am taking it back. May be this backup target not setup error should be recoverable and be retried

@mantissahz
Copy link
Contributor Author

@PhanLe1010 Do you mean that we should have some retries for the backup when encountering a backup target is invalid to check if the backup target is recoverable?

Copy link
Contributor

@shuo-wu shuo-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR itself LGTM (without considering longhorn/longhorn#6358 (comment)).

Without this fix, the backup CR can be never handled. The reconcile loop will directly error out rather than updating the CR state to Error.

When starting a backup monitor, it will start to take a snapshot
backup procedure and it will be failed if the s3 backup target is
invalid (nfs backup target will be not be failed because of timeout
and test the backup when it is in progress.)
Set the backup state as `Error` if starting the backup procedure
returns an error.

Ref: 1249

Signed-off-by: James Lu <james.lu@suse.com>
Copy link
Contributor

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this and will test it as the fix to longhorn/longhorn#6358 when it merges.

I am still thinking it would be better to have the ability to retry if the backup target becomes valid. But to @shuo-wu's point, without making this change, we can never know a backup is in error and we will keep silently trying over and over to do an impossible thing (this is the underlying complaint in longhorn/longhorn#6358). Maybe in the future we can consider an addition to backup monitor reconcile loop that can recognize if a backup's error is related to an invalid backup target and retrying if the backup target is now valid.

@innobead innobead self-requested a review July 27, 2023 15:34
@innobead innobead merged commit e278ba0 into longhorn:master Jul 27, 2023
4 checks passed
@innobead
Copy link
Member

@mergify backport v1.5.x v1.4.x v1.3.x

@mergify
Copy link

mergify bot commented Jul 27, 2023

backport v1.5.x v1.4.x v1.3.x

✅ Backports have been created

@innobead
Copy link
Member

Approving this and will test it as the fix to longhorn/longhorn#6358 when it merges.

I am still thinking it would be better to have the ability to retry if the backup target becomes valid. But to @shuo-wu's point, without making this change, we can never know a backup is in error and we will keep silently trying over and over to do an impossible thing (this is the underlying complaint in longhorn/longhorn#6358). Maybe in the future we can consider an addition to backup monitor reconcile loop that can recognize if a backup's error is related to an invalid backup target and retrying if the backup target is now valid.

Agreed with this point. @ejweber please create a ticket for this. This should be part of how resilient the system/function we want to make. Any operation should be considered with retry, backoff, etc if suitable/could. Surely, the audit should be clear during the resilience period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants