oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing #315

kaovilai · 2024-06-11T20:58:41Z

Signed-off-by: Tiger Kaovilai tkaovila@redhat.com

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #(issue)

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
Updated the corresponding documentation in site/content/docs/main.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> remove uuid, return err to requeue instead of requeue: true Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

kaovilai · 2024-06-11T21:00:45Z

from prior feedback need to do finalizer (operations in 1.3) controller also

kaovilai · 2024-06-11T21:35:23Z

from prior feedback need to do finalizer (operations in 1.3) controller also

For restore operations controller we're covered by this returning error:

velero/pkg/controller/restore_operations_controller.go

Line 194 in c06b138

return ctrl.Result{}, errors.Wrap(err, "error updating Restore")

backup operations controller:

velero/pkg/controller/backup_operations_controller.go

Line 207 in c06b138

return ctrl.Result{}, errors.Wrap(err, "error updating Backup")

sseago · 2024-06-11T21:47:27Z

Yes, I think we're fine leaving the backup/restore_operations_controller returns alone, since those already requeue. I don't think we have restore finalizer in 1.3, so we just need backup, restore, and backup finalizer.

kaovilai · 2024-06-11T21:59:18Z

If patch fails on finalizer controller we wanna retry patch as fail? It's kinda difficult there since it's currently using defer func.

I would have to break the patch call out of defer to return reconciler err on patch fail. Is that ok with you?

sseago

A couple of minor comments on error messages and the change of variable name. The bigger issue is we need the change for the backup finalizer controller as well (no restore finalizer controller in Velero 1.12, so that's not a concern here).

sseago · 2024-06-11T22:13:42Z

pkg/controller/backup_controller.go

+		log.Debug("Backup has in progress status from prior reconcile, marking it as failed")
+		failedCopy := original.DeepCopy()
+		failedCopy.Status.Phase = velerov1api.BackupPhaseFailed
+		failedCopy.Status.FailureReason = "Backup from previous reconcile still in progress"


We may want to suggest an APIServer failure here.
"Backup from previous reconcile still in progress. The API Server may have been down."

sseago · 2024-06-11T22:14:26Z

pkg/controller/backup_controller.go

@@ -249,7 +263,6 @@ func (b *backupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr
 		request.Status.Phase = velerov1api.BackupPhaseInProgress
 		request.Status.StartTimestamp = &metav1.Time{Time: b.clock.Now()}
 	}
-


To minimize change from upstream since this is in our fork, lets not include whitespace changes like this.

sseago · 2024-06-11T22:16:41Z

pkg/controller/restore_controller.go

+		log.Debug("Restore has in progress status from prior reconcile, marking it as failed")
+		failedCopy := original.DeepCopy()
+		failedCopy.Status.Phase = api.RestorePhaseFailed
+		failedCopy.Status.FailureReason = "Restore from previous reconcile still in progress"


We may want to suggest an APIServer failure here.
"Restore from previous reconcile still in progress. The API Server may have been down."

sseago · 2024-06-11T22:17:41Z

pkg/controller/restore_controller.go

@@ -162,8 +162,8 @@ func (r *restoreReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
 	// the controller.
 	log := r.logger.WithField("Restore", req.NamespacedName.String())

-	restore := &api.Restore{}
-	err := r.kbClient.Get(ctx, client.ObjectKey{Namespace: req.Namespace, Name: req.Name}, restore)
+	original := &api.Restore{}


Is the variable name change necessary here? This makes the diff larger and increases the possibility of rebase conflicts, since we're carrying this commit in our fork.

I did this so restore_controller and backup_controller has the same var name pattern that's all.

Not necessary I agree, just make the logic more pastable across both.

@kaovilai I figured that's why you did that. I reverted that part here in a later commit since we're carrying the commit in our fork right now, and it removed a lot of lines from the diff, making rebase conflicts less likely. Does that seem reasonable to you?

yes. That's reasonable

sseago · 2024-06-11T23:00:10Z

If patch fails on finalizer controller we wanna retry patch as fail? It's kinda difficult there since it's currently using defer func.

I would have to break the patch call out of defer to return reconciler err on patch fail. Is that ok with you?

Hmm. I think we need to do this in some way. If we can't do it with defer, then we'll need to eliminate the defer call and include this with all return statements.

Signed-off-by: Scott Seago <sseago@redhat.com>

Unlike the InProgress transition, there's no need to fail here, since the Finalize steps can be repeated.

Signed-off-by: Scott Seago <sseago@redhat.com>

pkg/controller/backup_finalizer_controller.go

kaovilai · 2024-06-16T00:06:26Z

lgtm

weshayutin · 2024-06-17T17:57:41Z

ok.. testing update:

w/o the patch, the backup stayed in progress. While updating the dpa the velero server was restarted.

backup: westest-vsphere-apidown-1

status:
  completionTimestamp: "2024-06-17T17:25:58Z"
  expiration: "2024-07-17T17:12:21Z"
  failureReason: found a backup with status "InProgress" during the server starting,
    mark it as "Failed"

1.3.0 have to change the csv to get the test image on:

oc get csv oadp-operator.v1.3.0 -o yaml | grep sseago
                  value: quay.io/sseago/velero:1.3-requeue

Once patched, I initiated a second backup and then took down the api server for roughly 1 minute:

backup: westest-vsphere-apidown-2

status:
  expiration: "2024-07-17T17:38:01Z"
  failureReason: Backup from previous reconcile still in progress. The API Server
    may have been down.
  formatVersion: 1.1.0
  phase: Failed
  startTimestamp: "2024-06-17T17:38:01Z"
  version: 1

weshayutin

/LGTM

sseago

Reviewing to remove my prior "changes requested", but don't count it as an ack, since my own changes are in here as well.

kaovilai · 2024-06-17T21:12:35Z

Just noting that this may never upstream based on comments at vmware-tanzu#7863 (comment)

openshift-ci · 2024-06-17T21:16:42Z

New changes are detected. LGTM label has been removed.

Makefile.prow

kaovilai · 2024-06-17T23:25:32Z

retest after #320

kaovilai · 2024-06-17T23:35:53Z

/retest

openshift-ci · 2024-06-18T13:51:02Z

@kaovilai: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci · 2024-06-18T13:55:45Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: kaovilai, sseago, weshayutin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kaovilai · 2024-06-25T18:58:28Z

follow on bugfix: #324

…330) * oadp-1.4: OADP-3227: Mark InProgress backup/restore as failed upon requeuing (#315) * Mark InProgress backup/restore as failed upon requeuing Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> remove uuid, return err to requeue instead of requeue: true Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * cleanup to minimize diff from upstream Signed-off-by: Scott Seago <sseago@redhat.com> * error message update Signed-off-by: Scott Seago <sseago@redhat.com> * requeue on finalize status update. Unlike the InProgress transition, there's no need to fail here, since the Finalize steps can be repeated. * Only run patch once for all backup finalizer return scenarios Signed-off-by: Scott Seago <sseago@redhat.com> --------- Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> Signed-off-by: Scott Seago <sseago@redhat.com> Co-authored-by: Scott Seago <sseago@redhat.com> * oadp-1.4: OADP-3227: Reconcile To Fail: Add backup/restore trackers (#324) * OADP-4265: Reconcile To Fail: Add backup/restore trackers Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * Apply suggestions from code review: backupTracker * Address restoreTracker feedback Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * s/delete from/add to/ in the comment * unit test fix Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * backup_controller unit test Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * restore_controller unit test Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * `make update` Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * mock patch to fail failure due to connection refused Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> --------- Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * regenerate mocks Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> --------- Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> Signed-off-by: Scott Seago <sseago@redhat.com> Co-authored-by: Scott Seago <sseago@redhat.com>

…penshift#330) * oadp-1.4: OADP-3227: Mark InProgress backup/restore as failed upon requeuing (openshift#315) * Mark InProgress backup/restore as failed upon requeuing Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> remove uuid, return err to requeue instead of requeue: true Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * cleanup to minimize diff from upstream Signed-off-by: Scott Seago <sseago@redhat.com> * error message update Signed-off-by: Scott Seago <sseago@redhat.com> * requeue on finalize status update. Unlike the InProgress transition, there's no need to fail here, since the Finalize steps can be repeated. * Only run patch once for all backup finalizer return scenarios Signed-off-by: Scott Seago <sseago@redhat.com> --------- Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> Signed-off-by: Scott Seago <sseago@redhat.com> Co-authored-by: Scott Seago <sseago@redhat.com> * oadp-1.4: OADP-3227: Reconcile To Fail: Add backup/restore trackers (openshift#324) * OADP-4265: Reconcile To Fail: Add backup/restore trackers Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * Apply suggestions from code review: backupTracker * Address restoreTracker feedback Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * s/delete from/add to/ in the comment * unit test fix Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * backup_controller unit test Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * restore_controller unit test Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * `make update` Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * mock patch to fail failure due to connection refused Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> --------- Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> * regenerate mocks Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> --------- Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> Signed-off-by: Scott Seago <sseago@redhat.com> Co-authored-by: Scott Seago <sseago@redhat.com>

Mark InProgress backup/restore as failed upon requeuing

c06b138

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com> remove uuid, return err to requeue instead of requeue: true Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

kaovilai changed the title ~~Mark InProgress backup/restore as failed upon requeuing~~ oadp-1.3: Mark InProgress backup/restore as failed upon requeuing Jun 11, 2024

sseago requested changes Jun 11, 2024

View reviewed changes

sseago added 3 commits June 12, 2024 11:52

cleanup to minimize diff from upstream

dfb6fb1

Signed-off-by: Scott Seago <sseago@redhat.com>

error message update

212ccfd

Signed-off-by: Scott Seago <sseago@redhat.com>

requeue on finalize status update.

d212b4f

Unlike the InProgress transition, there's no need to fail here, since the Finalize steps can be repeated.

weshayutin changed the title ~~oadp-1.3: Mark InProgress backup/restore as failed upon requeuing~~ oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing Jun 12, 2024

Only run patch once for all backup finalizer return scenarios

a51ef63

Signed-off-by: Scott Seago <sseago@redhat.com>

mateusoliveira43 reviewed Jun 13, 2024

View reviewed changes

pkg/controller/backup_finalizer_controller.go Show resolved Hide resolved

mateusoliveira43 reviewed Jun 13, 2024

View reviewed changes

pkg/controller/backup_finalizer_controller.go Show resolved Hide resolved

weshayutin approved these changes Jun 17, 2024

View reviewed changes

openshift-ci bot assigned weshayutin Jun 17, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2024

sseago approved these changes Jun 17, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2024

kaovilai force-pushed the requeue&fail-oadp-1.3 branch from 07bab34 to dd74100 Compare June 17, 2024 21:32

sseago reviewed Jun 17, 2024

View reviewed changes

Makefile.prow Outdated Show resolved Hide resolved

kaovilai force-pushed the requeue&fail-oadp-1.3 branch from dd74100 to a51ef63 Compare June 17, 2024 23:24

weshayutin added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Jun 18, 2024

openshift-merge-bot bot merged commit 4b5cf07 into openshift:oadp-1.3 Jun 18, 2024
3 checks passed

kaovilai mentioned this pull request Jun 21, 2024

[oadp-1.3] bump kubevirt to 0.6.2 openshift/oadp-operator#1436

Merged

kaovilai mentioned this pull request Jul 22, 2024

oadp-1.4: OADP-3227: Reconcile to fail on restore stuck in-progress #330

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing #315

oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing #315

kaovilai commented Jun 11, 2024 •

edited

Loading

kaovilai commented Jun 11, 2024 •

edited

Loading

kaovilai commented Jun 11, 2024

sseago commented Jun 11, 2024

kaovilai commented Jun 11, 2024

sseago left a comment

sseago Jun 11, 2024

sseago Jun 11, 2024

sseago Jun 11, 2024

sseago Jun 11, 2024

kaovilai Jun 13, 2024

kaovilai Jun 13, 2024

sseago Jun 13, 2024

kaovilai Jun 15, 2024

sseago commented Jun 11, 2024

kaovilai commented Jun 16, 2024

weshayutin commented Jun 17, 2024

weshayutin left a comment

sseago left a comment

This comment was marked as resolved.

kaovilai commented Jun 17, 2024

openshift-ci bot commented Jun 17, 2024

kaovilai commented Jun 17, 2024

kaovilai commented Jun 17, 2024

openshift-ci bot commented Jun 18, 2024

openshift-ci bot commented Jun 18, 2024

kaovilai commented Jun 25, 2024

oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing #315

oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing #315

Conversation

kaovilai commented Jun 11, 2024 • edited Loading

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

kaovilai commented Jun 11, 2024 • edited Loading

kaovilai commented Jun 11, 2024

sseago commented Jun 11, 2024

kaovilai commented Jun 11, 2024

sseago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sseago commented Jun 11, 2024

kaovilai commented Jun 16, 2024

weshayutin commented Jun 17, 2024

weshayutin left a comment

Choose a reason for hiding this comment

sseago left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

kaovilai commented Jun 17, 2024

openshift-ci bot commented Jun 17, 2024

kaovilai commented Jun 17, 2024

kaovilai commented Jun 17, 2024

openshift-ci bot commented Jun 18, 2024

openshift-ci bot commented Jun 18, 2024

kaovilai commented Jun 25, 2024

kaovilai commented Jun 11, 2024 •

edited

Loading

kaovilai commented Jun 11, 2024 •

edited

Loading