Skip to content

Conversation

@gaurav7261
Copy link
Contributor

[FEAT] databricks repair run with reason match and appropriate new settings in DatabricksRunNowOperator
This is useful in scenario like where we get databricks exception like AWS_INSUFFICIENT_INSTANCE_CAPACITY_FAILURE (CLIENT_ERROR)., AWS_MAX_SPOT_INSTANCE_COUNT_EXCEEDED_FAILURE where user want to repair run with let's say different zone of region or let's say with 100% ondemand

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@gaurav7261 gaurav7261 force-pushed the feat/databricks_repair_reason_with_new_settings branch 5 times, most recently from b1e0418 to 4ad4885 Compare August 12, 2024 21:00
@gaurav7261 gaurav7261 force-pushed the feat/databricks_repair_reason_with_new_settings branch from 4ad4885 to a21faf3 Compare August 12, 2024 21:31
@Lee-W Lee-W requested a review from pankajkoti August 14, 2024 06:24
@gaurav7261 gaurav7261 force-pushed the feat/databricks_repair_reason_with_new_settings branch 7 times, most recently from 98efbd6 to 3869e83 Compare August 23, 2024 17:13
@gaurav7261 gaurav7261 force-pushed the feat/databricks_repair_reason_with_new_settings branch from 3869e83 to 67fb037 Compare August 23, 2024 18:04
@gaurav7261
Copy link
Contributor Author

@Lee-W @shahar1 can you please review now

Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no huge issue, just a few nitpicks

@gaurav7261
Copy link
Contributor Author

@Lee-W require changes done, also added assert_called_with check for update_job method as well.

@gaurav7261
Copy link
Contributor Author

gaurav7261 commented Aug 25, 2024

@Lee-W should we have one more configurable like user want to repair only in case if databricks_repair_reason_new_settings dict key contains the reason, otherwise don't repair at all(as it is a legit error(like some job issue) and most probably will not get resolve even with one more repair run), I want to ask what you think for this ? and how can we enable that as well(have one more argument or databricks_repair_reason_new_settings is empty, don't repair or something in your mind) ?

@Lee-W
Copy link
Member

Lee-W commented Aug 25, 2024

@Lee-W should we have one more configurable like user want to repair only in case if databricks_repair_reason_new_settings dict key contains the reason, otherwise don't repair at all(as it is a legit error(like some job issue) and most probably will not get resolve even with one more repair run), I want to ask what you think for this ? and how can we enable that as well(have one more argument or databricks_repair_reason_new_settings is empty, don't repair or something in your mind) ?

I though we already have repair_run config?

@gaurav7261
Copy link
Contributor Author

gaurav7261 commented Aug 25, 2024

@Lee-W should we have one more configurable like user want to repair only in case if databricks_repair_reason_new_settings dict key contains the reason, otherwise don't repair at all(as it is a legit error(like some job issue) and most probably will not get resolve even with one more repair run), I want to ask what you think for this ? and how can we enable that as well(have one more argument or databricks_repair_reason_new_settings is empty, don't repair or something in your mind) ?

I though we already have repair_run config?

repair_run is just a bool currently

currently the flow is if repair_run bool is true, repair will get trigger always, if repair_reason matches in databricks_repair_reason_new_settings, we are just updating databricks job before calling repair run api,.

@Lee-W
Copy link
Member

Lee-W commented Aug 27, 2024

@Lee-W should we have one more configurable like user want to repair only in case if databricks_repair_reason_new_settings dict key contains the reason, otherwise don't repair at all(as it is a legit error(like some job issue) and most probably will not get resolve even with one more repair run), I want to ask what you think for this ? and how can we enable that as well(have one more argument or databricks_repair_reason_new_settings is empty, don't repair or something in your mind) ?

I though we already have repair_run config?

repair_run is just a bool currently

currently the flow is if repair_run bool is true, repair will get trigger always, if repair_reason matches in databricks_repair_reason_new_settings, we are just updating databricks job before calling repair run api,.

Yep, it sounds good. Is there anything missed?

@gaurav7261
Copy link
Contributor Author

@Lee-W I'm thinking for an option not to trigger repair if reason matches nothing in the dict databricks_repair_reason_new_settings , what's your thought on this?

@Lee-W
Copy link
Member

Lee-W commented Aug 28, 2024

@Lee-W I'm thinking for an option not to trigger repair if reason matches nothing in the dict databricks_repair_reason_new_settings , what's your thought on this?

I don't think we need a new option. If the reason is provided and nothing is matched, then we should not repair. This should be the default behavior. If the user wants to repair at all cases, than the reason dict should not be passed

@gaurav7261
Copy link
Contributor Author

thanks @Lee-W , making the changes, Yes, I also agree with same default behaviour.

@gaurav7261
Copy link
Contributor Author

@Lee-W changes are done, can you please review

@Lee-W Lee-W self-requested a review August 28, 2024 11:20
@Lee-W
Copy link
Member

Lee-W commented Aug 30, 2024

LGTM 👍

@Lee-W Lee-W merged commit 365b42f into apache:main Aug 30, 2024
@gaurav7261
Copy link
Contributor Author

@Lee-W currently we are checking in root state_message only, should we also check error key in every fail task array as well ?

@Lee-W
Copy link
Member

Lee-W commented Sep 4, 2024

@Lee-W currently we are checking in root state_message only, should we also check error key in every fail task array as well ?

Could you provide an example to illustrate this? Thanks!

@gaurav7261
Copy link
Contributor Author

For example, sometime in case of custom exception, when user want to retry with another job config, in root state_message we get only below one, and main exception in insider array
Job run failed with terminal state: {'life_cycle_state': 'INTERNAL_ERROR', 'result_state': 'FAILED', 'state_message': 'Task t1 failed with message: Workload failed, see run output for details. This caused all downstream tasks to get skipped.'} and with the errors [{'task_key': 't1', 'run_id': 293420302889009, 'error': '<span class=\'ansi-red-fg\'>APIError</span>: <!DOCTYPE html>\n<html lang=en>\n <meta charset=utf-8>\n <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">\n <title>Error 502 (Server Error)!!1</title>\n <style>\n {margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px} > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}\n </style>\n <a href=//www.google.com/><span id=logo aria-label=Google></span></a>\n <p><b>502.</b> <ins>That’s an error.</ins>\n <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds. <ins>That’s all we know.</ins>\n'}]

@gaurav7261
Copy link
Contributor Author

@Lee-W any suggestions?

@Lee-W
Copy link
Member

Lee-W commented Sep 10, 2024

It looks like the error is part of the state_message. the user can still match it in current implementation? or is there anything I missed

@gaurav7261
Copy link
Contributor Author

ya but not always, sometime root state_message = Task t1, t2... failed with message: Workload failed, see run output for details. This caused all downstream tasks to get skipped., nothing more information, in that case can we also allow matching with error key inside array of tasks? what you think @Lee-W , what I have seen in case of AWS side failures, we get root state_message correct, but when user throw custom exception or some databricks errors, state_message is same line Task failed: Workload failed, see run output for details..., nothing about main exception string

@Lee-W
Copy link
Member

Lee-W commented Sep 12, 2024

if that's the case, it makes sense to have one

@gaurav7261
Copy link
Contributor Author

cool, will create a new pr for the same. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants