-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put tasks that get resource quota error in pending #56
Conversation
@@ -455,20 +455,19 @@ func shouldRetry(app *v1beta2.SparkApplication) bool { | |||
} | |||
case v1beta2.PendingSubmissionState: | |||
var interval int64 = 257 | |||
if app.Spec.Mode == v1beta2.ClientMode && hasRetryIntervalPassed(&interval, app.Status.SubmissionAttempts, app.CreationTimestamp) && app.Status.SubmissionAttempts < 14 { | |||
if app.Spec.Mode != v1beta2.ClusterMode && hasRetryIntervalPassed(&interval, app.Status.SubmissionAttempts, app.CreationTimestamp) && app.Status.SubmissionAttempts <= 14 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not hardcode 14 and use the field set as part of CRD Spec. Defaulting if not set in CRD is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay i can set to submissionRetries but if its not in crd it will break the operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can default in https://github.com/lyft/spark-on-k8s-operator/blob/master/pkg/apis/sparkoperator.k8s.io/v1beta2/defaults.go to be safe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay sounds good, added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to test in flyte staging before merging this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this change there is an expoential retry for failures regarding resource quota, they are failed and then the spark application is resubmitted so the flyte ui just shows task as failed. It is requested that the task not fail and retry all 14 submissions using the same entry. This now done by keeping the spark application in pending submission until all 14 retries are done so on the flyte ui it is seen as still running.
https://jira.lyft.net/browse/DATAHELP-4360
https://jira.lyft.net/browse/DATAHELP-4388
Before and After on flyte UI using functional test
before: https://flyte-staging.lyft.net/console/projects/flytefunctionaltests/domains/development/executions/vejslkh4vp
after: https://flyte-staging.lyft.net/console/projects/flytefunctionaltests/domains/development/executions/t7gaw9zu5i
Screenshots from running on sonata-staging--
![Screen Shot 2021-05-05 at 2 52 18 PM](https://user-images.githubusercontent.com/55162494/117193997-8eb56000-adb1-11eb-9f0b-0e6857b34e61.png)
![Screen Shot 2021-05-05 at 2 51 47 PM](https://user-images.githubusercontent.com/55162494/117194186-c8866680-adb1-11eb-89ea-3119d3a87706.png)
Run #13
After 14 Runs