fix(recurring-job): avoid intermediate volume detachment during backup #2629

c3y1huang · 2024-02-22T06:36:08Z

Which issue(s) this PR fixes:

What this PR does / why we need it:

This was initially created to resolve the issue in longhorn/longhorn#7937, and @PhanLe1010 proposed an alternative solution to tackle the problem.

However, Longhorn could still benefit by avoiding the unnecessary volume detachment during the recurring job backup. Currently, the recurring job runs the snapshot followed by the backup, resulting in an intermediate volume detachment.

Special notes for your reviewer:

None

Additional documentation or context

longhorn/longhorn-7937 Signed-off-by: Chin-Ya Huang <chin-ya.huang@suse.com>

Signed-off-by: Chin-Ya Huang <chin-ya.huang@suse.com>

PhanLe1010 · 2024-02-23T00:34:00Z

Thank you for the idea! I have some feedbacks as:

We identified the root cause and fix the root cause. The test is passing 100/100 times now. So the benefit of this approach has become limited as the root cause has been fixed.
This approach might hide the potential races that should be been caught by e2e tests. Then other clients other than the recurring job might still hit the races.
If we are creating the VA ticket for recurring jobs there are a few concerns:
1. This PR currently set the VA ticket type as longhorn.AttacherTypeBackupController. The VA ticket is not created by the backup controller.
2. If we change the PR to use a different type AttacherTypeRecurringJobApp we will encounter other issues:
  1. If the recurring job created the VA ticket then it is deleted. Now, no one is going to cleanup the VA ticket and volume stuck in attached state forever.
  2. Even worse, the AttacherTypeRecurringJobApp is not interuptable so the workload pod cannot attach the volume. We can certainly make it interruptable but the logic will become more complicated.

Given the small benefit and the potential downside, I suggest that we don't go with this approach. However, if the team strongly agrees with this approach, it would be understandable for me.

c3y1huang · 2024-02-23T01:14:41Z

We identified the root cause and fix the root cause. The test is passing 100/100 times now. So the benefit of this approach has become limited as the root cause has been fixed./

@PhanLe1010 , the intention of this PR is not to overtake your approach, but to avoid unnecessary intermediate detachment when recurring job runs backup.

This PR currently set the VA ticket type as longhorn.AttacherTypeBackupController. The backup controller does not create the VA ticket.

Yes noticed, I am aiming for AttacherTypeRecurringJobApp and fixed it locally but haven't pushed it yet.

If we change the PR to use a different type AttacherTypeRecurringJobApp we will encounter other issues:
If the recurring job created the VA ticket then it is deleted. Now, no one is going to cleanup the VA ticket and volume stuck in attached state forever.

The recurring job controller should handle the case (fixed locally but hasn't been pushed). https://github.com/longhorn/longhorn-manager/blob/master/controller/recurring_job_controller.go#L355

Even worse, the AttacherTypeRecurringJobApp is not interuptable so the workload pod cannot attach the volume. We can certainly make it interruptable but the logic will become more complicated.

Could you elaborate?

This approach might hide the potential races that should be been caught by e2e tests. Then other clients other than the recurring job might still hit the races.

I agree with your point here even though this case was specifically for the recurring job.

I've closed this PR so we are not confused by its origin. This isn't a fix for longhorn/longhorn#7937; rather, it's an improvement stemming from it. We can reopen this if others think this would be beneficial.

PhanLe1010 · 2024-02-23T17:54:31Z

Even worse, the AttacherTypeRecurringJobApp is not interuptable so the workload pod cannot attach the volume. We can certainly make it interruptable but the logic will become more complicated.

Could you elaborate?

Hi @c3y1huang , the situation I was mentioning is that while the recurring job attaches the volume to take a backup, if the workload pod is scaled up and the pod is scheduled to a new node, the pod cannot attach the volume because the recurring job is still holding the volume on different node. If backup take 1 hour, the pod will not be able to start for 1 hour.

Thank you for the explanation about other points! Yeah, we can see the feedback from others. If the team strongly agrees with this approach, it would be understandable for me.

c3y1huang self-assigned this Feb 22, 2024

c3y1huang force-pushed the fix-snapshot-backup-rpc-unavailable branch from 439cf81 to c1d4266 Compare February 22, 2024 06:36

c3y1huang changed the title ~~fix(recurring-job): avoid intermediate volume detachment~~ fix(recurring-job): avoid intermediate volume detachment during backup Feb 22, 2024

c3y1huang force-pushed the fix-snapshot-backup-rpc-unavailable branch from c1d4266 to 04e9213 Compare February 22, 2024 06:37

innobead requested review from mantissahz and PhanLe1010 February 22, 2024 06:49

c3y1huang force-pushed the fix-snapshot-backup-rpc-unavailable branch 2 times, most recently from aa5ea8e to 08e8564 Compare February 22, 2024 07:05

fix(recurring-job): avoid intermediate detachment during backup

3bc6b5c

longhorn/longhorn-7937 Signed-off-by: Chin-Ya Huang <chin-ya.huang@suse.com>

c3y1huang force-pushed the fix-snapshot-backup-rpc-unavailable branch from 08e8564 to e3d4e35 Compare February 22, 2024 07:12

refactor(recurring-job): remove unused methods

5bec953

Signed-off-by: Chin-Ya Huang <chin-ya.huang@suse.com>

c3y1huang force-pushed the fix-snapshot-backup-rpc-unavailable branch from e3d4e35 to 5bec953 Compare February 22, 2024 07:13

c3y1huang closed this Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recurring-job): avoid intermediate volume detachment during backup #2629

fix(recurring-job): avoid intermediate volume detachment during backup #2629

c3y1huang commented Feb 22, 2024

PhanLe1010 commented Feb 23, 2024 •

edited

Loading

c3y1huang commented Feb 23, 2024 •

edited

Loading

PhanLe1010 commented Feb 23, 2024

fix(recurring-job): avoid intermediate volume detachment during backup #2629

fix(recurring-job): avoid intermediate volume detachment during backup #2629

Conversation

c3y1huang commented Feb 22, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

PhanLe1010 commented Feb 23, 2024 • edited Loading

c3y1huang commented Feb 23, 2024 • edited Loading

PhanLe1010 commented Feb 23, 2024

PhanLe1010 commented Feb 23, 2024 •

edited

Loading

c3y1huang commented Feb 23, 2024 •

edited

Loading