Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(recurring-job): avoid intermediate volume detachment during backup #2629

Closed

Conversation

c3y1huang
Copy link
Contributor

Which issue(s) this PR fixes:

Issue longhorn/longhorn#7937

What this PR does / why we need it:

This was initially created to resolve the issue in longhorn/longhorn#7937, and @PhanLe1010 proposed an alternative solution to tackle the problem.

However, Longhorn could still benefit by avoiding the unnecessary volume detachment during the recurring job backup. Currently, the recurring job runs the snapshot followed by the backup, resulting in an intermediate volume detachment.

Special notes for your reviewer:

None

Additional documentation or context

@c3y1huang c3y1huang self-assigned this Feb 22, 2024
@c3y1huang c3y1huang force-pushed the fix-snapshot-backup-rpc-unavailable branch from 439cf81 to c1d4266 Compare February 22, 2024 06:36
@c3y1huang c3y1huang changed the title fix(recurring-job): avoid intermediate volume detachment fix(recurring-job): avoid intermediate volume detachment during backup Feb 22, 2024
@c3y1huang c3y1huang force-pushed the fix-snapshot-backup-rpc-unavailable branch from c1d4266 to 04e9213 Compare February 22, 2024 06:37
@c3y1huang c3y1huang force-pushed the fix-snapshot-backup-rpc-unavailable branch 2 times, most recently from aa5ea8e to 08e8564 Compare February 22, 2024 07:05
longhorn/longhorn-7937

Signed-off-by: Chin-Ya Huang <chin-ya.huang@suse.com>
@c3y1huang c3y1huang force-pushed the fix-snapshot-backup-rpc-unavailable branch from 08e8564 to e3d4e35 Compare February 22, 2024 07:12
Signed-off-by: Chin-Ya Huang <chin-ya.huang@suse.com>
@c3y1huang c3y1huang force-pushed the fix-snapshot-backup-rpc-unavailable branch from e3d4e35 to 5bec953 Compare February 22, 2024 07:13
@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Feb 23, 2024

Thank you for the idea! I have some feedbacks as:

  1. We identified the root cause and fix the root cause. The test is passing 100/100 times now. So the benefit of this approach has become limited as the root cause has been fixed.
  2. This approach might hide the potential races that should be been caught by e2e tests. Then other clients other than the recurring job might still hit the races.
  3. If we are creating the VA ticket for recurring jobs there are a few concerns:
    1. This PR currently set the VA ticket type as longhorn.AttacherTypeBackupController. The VA ticket is not created by the backup controller.
    2. If we change the PR to use a different type AttacherTypeRecurringJobApp we will encounter other issues:
      1. If the recurring job created the VA ticket then it is deleted. Now, no one is going to cleanup the VA ticket and volume stuck in attached state forever.
      2. Even worse, the AttacherTypeRecurringJobApp is not interuptable so the workload pod cannot attach the volume. We can certainly make it interruptable but the logic will become more complicated.

Given the small benefit and the potential downside, I suggest that we don't go with this approach. However, if the team strongly agrees with this approach, it would be understandable for me.

@c3y1huang c3y1huang closed this Feb 23, 2024
@c3y1huang
Copy link
Contributor Author

c3y1huang commented Feb 23, 2024

We identified the root cause and fix the root cause. The test is passing 100/100 times now. So the benefit of this approach has become limited as the root cause has been fixed./

@PhanLe1010 , the intention of this PR is not to overtake your approach, but to avoid unnecessary intermediate detachment when recurring job runs backup.

This PR currently set the VA ticket type as longhorn.AttacherTypeBackupController. The backup controller does not create the VA ticket.

Yes noticed, I am aiming for AttacherTypeRecurringJobApp and fixed it locally but haven't pushed it yet.

If we change the PR to use a different type AttacherTypeRecurringJobApp we will encounter other issues:
If the recurring job created the VA ticket then it is deleted. Now, no one is going to cleanup the VA ticket and volume stuck in attached state forever.

The recurring job controller should handle the case (fixed locally but hasn't been pushed). https://github.com/longhorn/longhorn-manager/blob/master/controller/recurring_job_controller.go#L355

Even worse, the AttacherTypeRecurringJobApp is not interuptable so the workload pod cannot attach the volume. We can certainly make it interruptable but the logic will become more complicated.

Could you elaborate?

This approach might hide the potential races that should be been caught by e2e tests. Then other clients other than the recurring job might still hit the races.

I agree with your point here even though this case was specifically for the recurring job.

I've closed this PR so we are not confused by its origin. This isn't a fix for longhorn/longhorn#7937; rather, it's an improvement stemming from it. We can reopen this if others think this would be beneficial.

@PhanLe1010
Copy link
Contributor

Even worse, the AttacherTypeRecurringJobApp is not interuptable so the workload pod cannot attach the volume. We can certainly make it interruptable but the logic will become more complicated.

Could you elaborate?

Hi @c3y1huang , the situation I was mentioning is that while the recurring job attaches the volume to take a backup, if the workload pod is scaled up and the pod is scheduled to a new node, the pod cannot attach the volume because the recurring job is still holding the volume on different node. If backup take 1 hour, the pod will not be able to start for 1 hour.

Thank you for the explanation about other points! Yeah, we can see the feedback from others. If the team strongly agrees with this approach, it would be understandable for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants