Description
Problem
This issue applies to tasks on Airflow that interact with postgres (e.g. PythonOperator
s or Postgres*Operator
s).
Presently when a failure or execution timeout occurs, Airflow will raise an AirflowException
within the running task. For most cases operating on the python level, this is sufficient to stop the task. In cases where the task does not stop when asked kindly, Airflow will then send a SIGKILL
to the task and stop it.
When this happens to a task that involves postgres queries, the python script running on Airflow will stop but the postgres query running on the postgres backend will not. In some cases this can even cause the task to look like it's still running long after its execution timeout, because the Airflow task doesn't respond at all to the signals Airflow is sending and hangs until the query is complete (at least, to my recollection, we may need to confirm this). In cases where Airflow is able to advance, there are knock-on effects to the query still running.
Example: We had one instance where a load was occurring that had passed its execution timeout. I don't remember if we manually failed that task or if it failed on its own, but either way the DAG continued onto the next step which was to delete the intermediate load table. However, since the load to said intermediate table was still running on the postgres backend, the delete of the table hung until the former completed (or in this case, until we explicitly killed that query on the postgres backend).
Description
We should add a custom on_failure_callback
(or similar cleanup method) to certain postgres steps that we know will be longer running which will actually terminate the postgres query on the backend upon failure. This will involve looking up the query that is being run, then issuing a pg_terminate_backend
on the query ID.
-- Get running queries
SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
-- Kill query
SELECT pg_terminate_backend(<pid>);
Additional context
This came up specifically in discussions around WordPress/openverse-catalog#549, since it is a longer-running DAG. However other DAGs (Finnish museums in particular) may also suffer from similar circumstances.
Resolution
- 🙋 I would be interested in resolving this bug.
Metadata
Assignees
Labels
Type
Projects
Status
✅ Done