[FEAT] databricks repair run with reason match and appropriate new settings #41412

gaurav7261 · 2024-08-12T19:18:59Z

[FEAT] databricks repair run with reason match and appropriate new settings in DatabricksRunNowOperator
This is useful in scenario like where we get databricks exception like AWS_INSUFFICIENT_INSTANCE_CAPACITY_FAILURE (CLIENT_ERROR)., AWS_MAX_SPOT_INSTANCE_COUNT_EXCEEDED_FAILURE where user want to repair run with let's say different zone of region or let's say with 100% ondemand

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

…ttings

airflow/providers/databricks/operators/databricks.py

gaurav7261 · 2024-08-24T01:56:16Z

@Lee-W @shahar1 can you please review now

shahar1

LGTM

Lee-W

no huge issue, just a few nitpicks

airflow/providers/databricks/operators/databricks.py

gaurav7261 · 2024-08-25T05:26:26Z

@Lee-W require changes done, also added assert_called_with check for update_job method as well.

gaurav7261 · 2024-08-25T06:13:32Z

@Lee-W should we have one more configurable like user want to repair only in case if databricks_repair_reason_new_settings dict key contains the reason, otherwise don't repair at all(as it is a legit error(like some job issue) and most probably will not get resolve even with one more repair run), I want to ask what you think for this ? and how can we enable that as well(have one more argument or databricks_repair_reason_new_settings is empty, don't repair or something in your mind) ?

Lee-W · 2024-08-25T06:31:45Z

@Lee-W should we have one more configurable like user want to repair only in case if databricks_repair_reason_new_settings dict key contains the reason, otherwise don't repair at all(as it is a legit error(like some job issue) and most probably will not get resolve even with one more repair run), I want to ask what you think for this ? and how can we enable that as well(have one more argument or databricks_repair_reason_new_settings is empty, don't repair or something in your mind) ?

I though we already have repair_run config?

gaurav7261 · 2024-08-25T06:46:27Z

@Lee-W should we have one more configurable like user want to repair only in case if databricks_repair_reason_new_settings dict key contains the reason, otherwise don't repair at all(as it is a legit error(like some job issue) and most probably will not get resolve even with one more repair run), I want to ask what you think for this ? and how can we enable that as well(have one more argument or databricks_repair_reason_new_settings is empty, don't repair or something in your mind) ?

I though we already have repair_run config?

repair_run is just a bool currently

currently the flow is if repair_run bool is true, repair will get trigger always, if repair_reason matches in databricks_repair_reason_new_settings, we are just updating databricks job before calling repair run api,.

Lee-W · 2024-08-27T13:40:52Z

@Lee-W should we have one more configurable like user want to repair only in case if databricks_repair_reason_new_settings dict key contains the reason, otherwise don't repair at all(as it is a legit error(like some job issue) and most probably will not get resolve even with one more repair run), I want to ask what you think for this ? and how can we enable that as well(have one more argument or databricks_repair_reason_new_settings is empty, don't repair or something in your mind) ?

I though we already have repair_run config?

repair_run is just a bool currently

currently the flow is if repair_run bool is true, repair will get trigger always, if repair_reason matches in databricks_repair_reason_new_settings, we are just updating databricks job before calling repair run api,.

Yep, it sounds good. Is there anything missed?

gaurav7261 · 2024-08-28T02:35:57Z

@Lee-W I'm thinking for an option not to trigger repair if reason matches nothing in the dict databricks_repair_reason_new_settings , what's your thought on this?

Lee-W · 2024-08-28T08:40:45Z

@Lee-W I'm thinking for an option not to trigger repair if reason matches nothing in the dict databricks_repair_reason_new_settings , what's your thought on this?

I don't think we need a new option. If the reason is provided and nothing is matched, then we should not repair. This should be the default behavior. If the user wants to repair at all cases, than the reason dict should not be passed

gaurav7261 · 2024-08-28T08:53:14Z

thanks @Lee-W , making the changes, Yes, I also agree with same default behaviour.

gaurav7261 · 2024-08-28T10:55:59Z

@Lee-W changes are done, can you please review

Lee-W · 2024-08-30T10:14:35Z

LGTM 👍

gaurav7261 · 2024-09-03T08:12:47Z

@Lee-W currently we are checking in root state_message only, should we also check error key in every fail task array as well ?

Lee-W · 2024-09-04T03:19:08Z

@Lee-W currently we are checking in root state_message only, should we also check error key in every fail task array as well ?

Could you provide an example to illustrate this? Thanks!

gaurav7261 · 2024-09-05T17:39:23Z

For example, sometime in case of custom exception, when user want to retry with another job config, in root state_message we get only below one, and main exception in insider array
Job run failed with terminal state: {'life_cycle_state': 'INTERNAL_ERROR', 'result_state': 'FAILED', 'state_message': 'Task t1 failed with message: Workload failed, see run output for details. This caused all downstream tasks to get skipped.'} and with the errors [{'task_key': 't1', 'run_id': 293420302889009, 'error': '<span class=\'ansi-red-fg\'>APIError</span>: <!DOCTYPE html>\n<html lang=en>\n <meta charset=utf-8>\n <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">\n <title>Error 502 (Server Error)!!1</title>\n <style>\n {margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px} > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}\n </style>\n <a href=//www.google.com/><span id=logo aria-label=Google></span></a>\n <p><b>502.</b> <ins>That’s an error.</ins>\n <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds. <ins>That’s all we know.</ins>\n'}]

gaurav7261 · 2024-09-10T12:20:07Z

@Lee-W any suggestions?

Lee-W · 2024-09-10T22:01:00Z

It looks like the error is part of the state_message. the user can still match it in current implementation? or is there anything I missed

gaurav7261 · 2024-09-11T04:10:25Z

ya but not always, sometime root state_message = Task t1, t2... failed with message: Workload failed, see run output for details. This caused all downstream tasks to get skipped., nothing more information, in that case can we also allow matching with error key inside array of tasks? what you think @Lee-W , what I have seen in case of AWS side failures, we get root state_message correct, but when user throw custom exception or some databricks errors, state_message is same line Task failed: Workload failed, see run output for details..., nothing about main exception string

Lee-W · 2024-09-12T05:56:52Z

if that's the case, it makes sense to have one

gaurav7261 · 2024-09-12T11:27:42Z

cool, will create a new pr for the same. Thanks

boring-cyborg bot added area:providers provider:databricks labels Aug 12, 2024

gaurav7261 force-pushed the feat/databricks_repair_reason_with_new_settings branch 5 times, most recently from b1e0418 to 4ad4885 Compare August 12, 2024 21:00

[FEAT] databricks repair run with reason match and appropriate new se…

a21faf3

…ttings

gaurav7261 force-pushed the feat/databricks_repair_reason_with_new_settings branch from 4ad4885 to a21faf3 Compare August 12, 2024 21:31

Lee-W requested a review from pankajkoti August 14, 2024 06:24

shahar1 reviewed Aug 14, 2024

View reviewed changes

airflow/providers/databricks/operators/databricks.py Outdated Show resolved Hide resolved

airflow/providers/databricks/operators/databricks.py Outdated Show resolved Hide resolved

Lee-W reviewed Aug 15, 2024

View reviewed changes

airflow/providers/databricks/operators/databricks.py Outdated Show resolved Hide resolved

airflow/providers/databricks/operators/databricks.py Outdated Show resolved Hide resolved

gaurav7261 force-pushed the feat/databricks_repair_reason_with_new_settings branch 7 times, most recently from 98efbd6 to 3869e83 Compare August 23, 2024 17:13

[CHORE] PR review changes

67fb037

gaurav7261 force-pushed the feat/databricks_repair_reason_with_new_settings branch from 3869e83 to 67fb037 Compare August 23, 2024 18:04

shahar1 approved these changes Aug 24, 2024

View reviewed changes

[CHORE] remove JOB_ID from DATABRICKS_REPAIR_REASON_NEW_SETTINGS dict

faff87a

Lee-W reviewed Aug 25, 2024

View reviewed changes

airflow/providers/databricks/operators/databricks.py Outdated Show resolved Hide resolved

airflow/providers/databricks/operators/databricks.py Outdated Show resolved Hide resolved

gaurav7261 added 2 commits August 25, 2024 09:32

[FIX] DATABRICKS_REPAIR_REASON_NEW_SETTINGS dict

2103a4a

[CHORE] adds type annotation and assert_called_with check in test

c67430a

[CHORE] repair only when reason match or reason dict not passed

a6a6b8e

Lee-W self-requested a review August 28, 2024 11:20

Lee-W approved these changes Aug 30, 2024

View reviewed changes

Lee-W merged commit 365b42f into apache:main Aug 30, 2024

eladkal mentioned this pull request Sep 21, 2024

Status of testing Providers that were prepared on September 21, 2024 #42393

Closed

[FEAT] databricks repair run with reason match and appropriate new settings #41412

[FEAT] databricks repair run with reason match and appropriate new settings #41412

Uh oh!

Conversation

gaurav7261 commented Aug 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaurav7261 commented Aug 24, 2024

Uh oh!

shahar1 left a comment

Choose a reason for hiding this comment

Uh oh!

Lee-W left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gaurav7261 commented Aug 25, 2024

Uh oh!

gaurav7261 commented Aug 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lee-W commented Aug 25, 2024

Uh oh!

gaurav7261 commented Aug 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lee-W commented Aug 27, 2024

Uh oh!

gaurav7261 commented Aug 28, 2024

Uh oh!

Lee-W commented Aug 28, 2024

Uh oh!

gaurav7261 commented Aug 28, 2024

Uh oh!

gaurav7261 commented Aug 28, 2024

Uh oh!

Lee-W commented Aug 30, 2024

Uh oh!

gaurav7261 commented Sep 3, 2024

Uh oh!

Lee-W commented Sep 4, 2024

Uh oh!

gaurav7261 commented Sep 5, 2024

Uh oh!

gaurav7261 commented Sep 10, 2024

Uh oh!

Lee-W commented Sep 10, 2024

Uh oh!

gaurav7261 commented Sep 11, 2024

Uh oh!

Lee-W commented Sep 12, 2024

Uh oh!

gaurav7261 commented Sep 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gaurav7261 commented Aug 25, 2024 •

edited

Loading

gaurav7261 commented Aug 25, 2024 •

edited

Loading