Skip to content

Ensure PG backend query is terminated on failure #1455

Closed
WordPress/openverse-catalog
#717

Description

Problem

This issue applies to tasks on Airflow that interact with postgres (e.g. PythonOperators or Postgres*Operators).

Presently when a failure or execution timeout occurs, Airflow will raise an AirflowException within the running task. For most cases operating on the python level, this is sufficient to stop the task. In cases where the task does not stop when asked kindly, Airflow will then send a SIGKILL to the task and stop it.

When this happens to a task that involves postgres queries, the python script running on Airflow will stop but the postgres query running on the postgres backend will not. In some cases this can even cause the task to look like it's still running long after its execution timeout, because the Airflow task doesn't respond at all to the signals Airflow is sending and hangs until the query is complete (at least, to my recollection, we may need to confirm this). In cases where Airflow is able to advance, there are knock-on effects to the query still running.

Example: We had one instance where a load was occurring that had passed its execution timeout. I don't remember if we manually failed that task or if it failed on its own, but either way the DAG continued onto the next step which was to delete the intermediate load table. However, since the load to said intermediate table was still running on the postgres backend, the delete of the table hung until the former completed (or in this case, until we explicitly killed that query on the postgres backend).

Description

We should add a custom on_failure_callback (or similar cleanup method) to certain postgres steps that we know will be longer running which will actually terminate the postgres query on the backend upon failure. This will involve looking up the query that is being run, then issuing a pg_terminate_backend on the query ID.

-- Get running queries
SELECT pid, age(clock_timestamp(), query_start), usename, query 
FROM pg_stat_activity 
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' 
ORDER BY query_start desc;

-- Kill query
SELECT pg_terminate_backend(<pid>);

Additional context

This came up specifically in discussions around WordPress/openverse-catalog#549, since it is a longer-running DAG. However other DAGs (Finnish museums in particular) may also suffer from similar circumstances.

Resolution

  • 🙋 I would be interested in resolving this bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Type

No type

Projects

  • Status

    ✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions