Skip to content

Conversation

@sjyangkevin
Copy link
Contributor

@sjyangkevin sjyangkevin commented Sep 4, 2025

Close: #55054

Reproduce the error

The task failed due to the conflict error.
Screenshot from 2025-09-03 23-44-51

The user action is taken, so the issue is likely in the trigger timeout fallback.
Screenshot from 2025-09-04 00-03-06

The root cause for the issue

The timeout logic in the HITL trigger did not check whether a response had already been received before applying the default fallback. As a result, if a user submitted a response right before the timeout, both the user action and the timeout handler could attempt to update the HITL detail simultaneously, causing the conflict error.

Approach

Added a check in the timeout handler to fetch the HITL detail and verify if a response has already been received. If a response exists, the timeout fallback is skipped.

Result

The timeout fallback will be skipped if an action has been performed by an user right before the timeout.
Screenshot from 2025-09-04 00-19-32

Screenshot from 2025-09-04 00-28-18

Would appreciate any feedback if the code structure and the trigger loop can be further optimized. Working on adding test cases.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@sjyangkevin
Copy link
Contributor Author

Hi @Lee-W and @vatsrahul1001 , I think I figure out the root cause and propose the method to add a check for user response before running the timeout fallback logic. Let me know if you have any feedback. I will add the test cases by tomorrow. thanks!

Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. should be able to merge once test added

@sjyangkevin sjyangkevin force-pushed the issues/55054/hitl-409-conflict branch from ad2413c to 81ff143 Compare September 5, 2025 03:08
@sjyangkevin sjyangkevin changed the title [WIP] HITL: Resolve Conflict 409 in API server when user actions at nearly timeout HITL: Resolve Conflict 409 in API server when user actions at nearly timeout Sep 5, 2025
Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@sjyangkevin
Copy link
Contributor Author

Hi @Lee-W , I've added the test cases and notice a small issue when running the test cases for timeout. From my understanding, the intention of using await asyncio.sleep(0.3) is to move the timestamp forward. Then, in the trigger, we can compare the timeout_datetime with the current time (i.e., utcnow()) for testing the timeout. However, the behavior could be different when setting the time sleep in the following ways. Let me know if it make sense, or if I need to simplify the description. Thanks!

Method A

trigger_task = asyncio.create_task(gen.__anext__())
await asyncio.sleep(0.3)

I put some logging commands in the trigger code and observe the following situation. In the terminal output, right under ##1 flag. The first timestamp is the timeout_datetime and the second timestamp is the utcnow() in the run method. It looks like asyncio.create_task(gen.__anext__()) will run the task right the way, so the utcnow() in the run method is slightly earlier than the timeout_datetime=utcnow() + timedelta(seconds=0.1). Therefore, the timeout handle block is not triggered because comparing utcnow() and timeout_datetime returns False.

Screenshot from 2025-09-04 22-44-24

Why the existing test case can still jump into the timeout handle?

I ran multiple experiments, and I think it is related to how the mock is set. If either response_received is false or chosen_options is None. The if resp.response_received and resp.chosen_options: will be evaluated as false and continue on the while loop (because the above scenario, it doesn't jump into the timeout handle and the first time). Then, the utcnow() evaluated again, and this time is greater than the set timeout_datetime. So it jumps into timeout handle.

mock_supervisor_comms.send.return_value = HITLDetailResponse(
    response_received=False,
    responded_user_id=None,
    responded_user_name=None,
    response_at=None,
    chosen_options=None,
    params_input={},
)

Method B

await asyncio.sleep(0.3)
trigger_task = asyncio.create_task(gen.__anext__())

By using this approach, the above issue will be fixed. Since we wait for a short moment and start executing the trigger. The utcnow() in the run method will be greater than the timeout_datetime when the run method is invoked.
Screenshot from 2025-09-04 22-44-00

@Lee-W
Copy link
Member

Lee-W commented Sep 5, 2025

the utcnow() evaluated again, and this time is greater than the set timeout_datetime.

If it's assigned to a variable ahead, it should be a fix value 🤔 but the new change still makes sense to me. the underline logic of checking the timeout is correct

@Lee-W Lee-W merged commit fe1edd4 into apache:main Sep 5, 2025
107 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in AIP-90 - Human in the loop Sep 5, 2025
RoyLee1224 pushed a commit to RoyLee1224/airflow that referenced this pull request Sep 8, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Sep 30, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 1, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 2, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 3, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 4, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 5, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 5, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 7, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 8, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 9, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 10, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 11, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 12, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 14, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 15, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 17, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

HITL: Getting 409 in API server

2 participants