Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put tasks that get resource quota error in pending #56

Merged
merged 7 commits into from
May 5, 2021

Conversation

michaelawilkins
Copy link

@michaelawilkins michaelawilkins commented May 4, 2021

Before this change there is an expoential retry for failures regarding resource quota, they are failed and then the spark application is resubmitted so the flyte ui just shows task as failed. It is requested that the task not fail and retry all 14 submissions using the same entry. This now done by keeping the spark application in pending submission until all 14 retries are done so on the flyte ui it is seen as still running.

https://jira.lyft.net/browse/DATAHELP-4360
https://jira.lyft.net/browse/DATAHELP-4388

Before and After on flyte UI using functional test
before: https://flyte-staging.lyft.net/console/projects/flytefunctionaltests/domains/development/executions/vejslkh4vp
after: https://flyte-staging.lyft.net/console/projects/flytefunctionaltests/domains/development/executions/t7gaw9zu5i

Screenshots from running on sonata-staging--
Run #13
Screen Shot 2021-05-05 at 2 52 18 PM
After 14 Runs
Screen Shot 2021-05-05 at 2 51 47 PM

@michaelawilkins michaelawilkins changed the base branch from multi-v1beta2 to master May 5, 2021 18:47
@michaelawilkins
Copy link
Author

@michaelawilkins michaelawilkins changed the title Put tasks that resource quota in pending Put tasks that get resource quota error in pending May 5, 2021
@@ -455,20 +455,19 @@ func shouldRetry(app *v1beta2.SparkApplication) bool {
}
case v1beta2.PendingSubmissionState:
var interval int64 = 257
if app.Spec.Mode == v1beta2.ClientMode && hasRetryIntervalPassed(&interval, app.Status.SubmissionAttempts, app.CreationTimestamp) && app.Status.SubmissionAttempts < 14 {
if app.Spec.Mode != v1beta2.ClusterMode && hasRetryIntervalPassed(&interval, app.Status.SubmissionAttempts, app.CreationTimestamp) && app.Status.SubmissionAttempts <= 14 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not hardcode 14 and use the field set as part of CRD Spec. Defaulting if not set in CRD is fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay i can set to submissionRetries but if its not in crd it will break the operator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay sounds good, added

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to test in flyte staging before merging this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants