Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][GCS FT] Redis cleanup improvement #1557

Closed
5 of 6 tasks
Tracked by #1033
kevin85421 opened this issue Oct 22, 2023 · 5 comments
Closed
5 of 6 tasks
Tracked by #1033

[Feature][GCS FT] Redis cleanup improvement #1557

kevin85421 opened this issue Oct 22, 2023 · 5 comments
Assignees
Labels
1.0 enhancement New feature or request

Comments

@kevin85421
Copy link
Member

kevin85421 commented Oct 22, 2023

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added enhancement New feature or request 1.0 labels Oct 22, 2023
@kevin85421
Copy link
Member Author

@rueian @evalaiyc98 will take this issue.

@rueian
Copy link
Contributor

rueian commented Oct 29, 2023

In terms of issue 3, there are two job configurations we can leverage:

  1. .spec.backoffLimit
  2. .spec.activeDeadlineSeconds

When setting the .spec.backoffLimit to 5, the job will be terminated after 5+1 tries which takes about 310 seconds (due to the backoff) if no pod is stuck in the pending phase.

In contrast, when setting the .spec.activeDeadlineSeconds, the job will terminated after the deadline no matter what, even if the job pod is stuck in the pending phase.

I would recommend setting the .spec.activeDeadlineSeconds if we want to guarantee that the cleanup job will terminate.

Either way, after setting up one of these options, we can then detect if the job is terminated by checking its .status.Conditions here between L283 and L284:

}
if redisCleanupJob.Status.Failed > 0 {

And then remove the finalizer. For example:

				for _, condition := range redisCleanupJob.Status.Conditions {
					if condition.Status == corev1.ConditionTrue && condition.Type == batchv1.JobFailed {
						controllerutil.RemoveFinalizer(instance, common.GCSFaultToleranceRedisCleanupFinalizer)
						if err := r.Update(ctx, instance); err != nil {
							return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err
						}
						return ctrl.Result{}, nil
					}
				}
				if redisCleanupJob.Status.Failed > 0 {
					r.Log.Info("If the Redis cleanup Job has failed, we will requeue the RayCluster CR after 1 minute.")
					return ctrl.Result{RequeueAfter: 1 * time.Minute}, nil
				}

@kevin85421
Copy link
Member Author

#1557 (comment)

This makes sense.

  1. activeDeadlineSeconds exists in Kubernetes 1.19, so it will not cause any compatibility issues.
  2. Maybe we can also consider using the utility function from Kubernetes if we do not need to import new packages.

@kevin85421
Copy link
Member Author

I will take issue 3, and @evalaiyc98 will take issue 2.

@kevin85421
Copy link
Member Author

kevin85421 commented Nov 1, 2023

I've reconsidered issue 3. I've decided not to adopt the best-effort strategy in v1.0.0. The reason is that the Redis cleanup feature isn't stable enough, and adopting this strategy might cause us to miss some opportunities to enhance the feature.

When we are confident that the Redis cleanup feature works well in most scenarios, we can adopt the best-effort strategy to enhance the user experience during certain edge cases, such as network connection issues.

In v1.0.0, I will minimize the UX friction if the Redis cleanup fails by (1) Minimizing the computing resources for the K8s Job and (2) Setting the backoffLimit to 0 to avoid some issues related to ResourceQuota.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0 enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants