Fix scheduler heartbeat misses caused by slow reschedule dependency check#61983
Conversation
…heck When many task instances enter UP_FOR_RESCHEDULE state, the query to fetch the latest reschedule date becomes slow due to a missing composite index. This causes the scheduler to miss heartbeats. Previously only sensors used reschedule mode, but since fddf4a7, non-sensor tasks can also be rescheduled, significantly increasing the number of rows per task instance in the task_reschedule table. Add a composite (ti_id, id DESC) index to the task_reschedule table, replacing the single-column (ti_id) index.
7beb11c to
c8f33d8
Compare
Backport failed to create: v3-1-test. View the failure log Run detailsNote: As of Merging PRs targeted for Airflow 3.X In matter of doubt please ask in #release-management Slack channel.
You can attempt to backport this manually by running: cherry_picker 9880716 v3-1-testThis should apply the commit to the v3-1-test branch and leave the commit in conflict state marking After you have resolved the conflicts, you can continue the backport process by running: cherry_picker --continueIf you don't have cherry-picker installed, see the installation guide. |
…heck (apache#61983) When many task instances enter UP_FOR_RESCHEDULE state, the query to fetch the latest reschedule date becomes slow due to a missing composite index. This causes the scheduler to miss heartbeats. Previously only sensors used reschedule mode, but since fddf4a7, non-sensor tasks can also be rescheduled, significantly increasing the number of rows per task instance in the task_reschedule table. Add a composite (ti_id, id DESC) index to the task_reschedule table, replacing the single-column (ti_id) index.
…heck (#61983) When many task instances enter UP_FOR_RESCHEDULE state, the query to fetch the latest reschedule date becomes slow due to a missing composite index. This causes the scheduler to miss heartbeats. Previously only sensors used reschedule mode, but since fddf4a7, non-sensor tasks can also be rescheduled, significantly increasing the number of rows per task instance in the task_reschedule table. Add a composite (ti_id, id DESC) index to the task_reschedule table, replacing the single-column (ti_id) index. (cherry picked from commit 9880716)
…dependency check (#61983) (#62068) * Add index on task_reschedule ti_id (#60931) (cherry picked from commit 14e811c) * Fix scheduler heartbeat misses caused by slow reschedule dependency check (#61983) When many task instances enter UP_FOR_RESCHEDULE state, the query to fetch the latest reschedule date becomes slow due to a missing composite index. This causes the scheduler to miss heartbeats. Previously only sensors used reschedule mode, but since fddf4a7, non-sensor tasks can also be rescheduled, significantly increasing the number of rows per task instance in the task_reschedule table. Add a composite (ti_id, id DESC) index to the task_reschedule table, replacing the single-column (ti_id) index. (cherry picked from commit 9880716) --------- Co-authored-by: Guan-Ming (Wesley) Chiu <105915352+guan404ming@users.noreply.github.com>
…heck (apache#61983) When many task instances enter UP_FOR_RESCHEDULE state, the query to fetch the latest reschedule date becomes slow due to a missing composite index. This causes the scheduler to miss heartbeats. Previously only sensors used reschedule mode, but since fddf4a7, non-sensor tasks can also be rescheduled, significantly increasing the number of rows per task instance in the task_reschedule table. Add a composite (ti_id, id DESC) index to the task_reschedule table, replacing the single-column (ti_id) index.
…heck (apache#61983) When many task instances enter UP_FOR_RESCHEDULE state, the query to fetch the latest reschedule date becomes slow due to a missing composite index. This causes the scheduler to miss heartbeats. Previously only sensors used reschedule mode, but since fddf4a7, non-sensor tasks can also be rescheduled, significantly increasing the number of rows per task instance in the task_reschedule table. Add a composite (ti_id, id DESC) index to the task_reschedule table, replacing the single-column (ti_id) index.
…dependency check (#61983) (#62068) * Add index on task_reschedule ti_id (#60931) (cherry picked from commit 14e811c) * Fix scheduler heartbeat misses caused by slow reschedule dependency check (#61983) When many task instances enter UP_FOR_RESCHEDULE state, the query to fetch the latest reschedule date becomes slow due to a missing composite index. This causes the scheduler to miss heartbeats. Previously only sensors used reschedule mode, but since fddf4a7, non-sensor tasks can also be rescheduled, significantly increasing the number of rows per task instance in the task_reschedule table. Add a composite (ti_id, id DESC) index to the task_reschedule table, replacing the single-column (ti_id) index. (cherry picked from commit 9880716) --------- Co-authored-by: Guan-Ming (Wesley) Chiu <105915352+guan404ming@users.noreply.github.com>
…heck (apache#61983) When many task instances enter UP_FOR_RESCHEDULE state, the query to fetch the latest reschedule date becomes slow due to a missing composite index. This causes the scheduler to miss heartbeats. Previously only sensors used reschedule mode, but since fddf4a7, non-sensor tasks can also be rescheduled, significantly increasing the number of rows per task instance in the task_reschedule table. Add a composite (ti_id, id DESC) index to the task_reschedule table, replacing the single-column (ti_id) index.
…heck (apache#61983) When many task instances enter UP_FOR_RESCHEDULE state, the query to fetch the latest reschedule date becomes slow due to a missing composite index. This causes the scheduler to miss heartbeats. Previously only sensors used reschedule mode, but since fddf4a7, non-sensor tasks can also be rescheduled, significantly increasing the number of rows per task instance in the task_reschedule table. Add a composite (ti_id, id DESC) index to the task_reschedule table, replacing the single-column (ti_id) index.
|
Thank you very much for this contribution, I was having headaches understanding why my scheduler was stuck for 30s and was slowing down my whole pipeline. This PR solves everything |
When many task instances enter UP_FOR_RESCHEDULE state, the query to fetch the latest reschedule date becomes slow due to a missing composite index. This causes the scheduler to miss heartbeats.
Previously only sensors used reschedule mode, but since fddf4a7, non-sensor tasks can also be rescheduled, significantly increasing the number of rows per task instance in the task_reschedule table.
Add a composite (ti_id, id DESC) index to the task_reschedule table, replacing the single-column (ti_id) index.
The reschedule query:
airflow/airflow-core/src/airflow/ti_deps/deps/ready_to_reschedule.py
Line 73 in baa8c72
Other places this can benefit:
airflow/airflow-core/src/airflow/models/taskinstance.py
Line 1171 in baa8c72
airflow/airflow-core/src/airflow/api_fastapi/execution_api/routes/task_reschedules.py
Line 40 in baa8c72