-
Notifications
You must be signed in to change notification settings - Fork 16.4k
HITL: Resolve Conflict 409 in API server when user actions at nearly timeout #55243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @Lee-W and @vatsrahul1001 , I think I figure out the root cause and propose the method to add a check for user response before running the timeout fallback logic. Let me know if you have any feedback. I will add the test cases by tomorrow. thanks! |
Lee-W
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. should be able to merge once test added
ad2413c to
81ff143
Compare
Lee-W
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
Hi @Lee-W , I've added the test cases and notice a small issue when running the test cases for timeout. From my understanding, the intention of using Method Atrigger_task = asyncio.create_task(gen.__anext__())
await asyncio.sleep(0.3)I put some logging commands in the trigger code and observe the following situation. In the terminal output, right under
Why the existing test case can still jump into the timeout handle?I ran multiple experiments, and I think it is related to how the mock is set. If either mock_supervisor_comms.send.return_value = HITLDetailResponse(
response_received=False,
responded_user_id=None,
responded_user_name=None,
response_at=None,
chosen_options=None,
params_input={},
)Method Bawait asyncio.sleep(0.3)
trigger_task = asyncio.create_task(gen.__anext__())By using this approach, the above issue will be fixed. Since we wait for a short moment and start executing the trigger. The |
If it's assigned to a variable ahead, it should be a fix value 🤔 but the new change still makes sense to me. the underline logic of checking the timeout is correct |


Close: #55054
Reproduce the error
The task failed due to the conflict error.

The user action is taken, so the issue is likely in the trigger timeout fallback.

The root cause for the issue
The timeout logic in the HITL trigger did not check whether a response had already been received before applying the default fallback. As a result, if a user submitted a response right before the timeout, both the user action and the timeout handler could attempt to update the HITL detail simultaneously, causing the conflict error.
Approach
Added a check in the timeout handler to fetch the HITL detail and verify if a response has already been received. If a response exists, the timeout fallback is skipped.
Result
The timeout fallback will be skipped if an action has been performed by an user right before the timeout.

Would appreciate any feedback if the code structure and the trigger loop can be further optimized. Working on adding test cases.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.