Description
Context
In a recent conversation about unplanned EC2 instance restarts and how they might affect our infrastructure, we began discussing how our current DAGs might handle a random dropout. Here's that conversation for posterity:
From @sarayourfriend:
The main one that comes to mind is Airflow. The big question (from Sara) is what the risks of an Airflow instance fully shutting down at a random time are. For example, what happens if a data refresh is shut down in the middle. Does the data refresh DAG tracking the progress and state of the indexer workers/ingestion server in such a way that it can pick it back up if it's restarted in the middle? What about the staging database refresh? That DAG mostly uses AWS operators to implement the full cycle. Is it safe if interrupted? How about other DAGs that interact with infrastructure pieces like the catalog DB snapshot DAG. Does the batched update DAG, if cancelled in the middle of an update task, track that fact and able to recover? Maybe we need to make sure that batched update update queries are written in an idempotent way. Existing ones may be (we should audit and confirm to be certain), but if it could be an issue then we should add some kind of safeguard to ensure it (e.g., transaction handling if we're not already doing that, some kind of ability for the DAG to know exactly what part of the update task it failed at, where it's safe to automatically recover and where might need manual intervention, etc). Multiply this line of questioning by all critical DAGs. Most should be okay, by the nature of how Airflow is designed to work, but any long-running individual tasks may be problematic if they do not persist state outside of RAM.
From @AetherUnbound:
Does the data refresh DAG tracking the progress and state of the indexer workers/ingestion server in such a way that it can pick it back up if it's restarted in the middle?
Currently, no. Usually the easiest way to address this is just to start the refresh process over again. This will certainly change once the ingestion server removal project is complete. With that, we should be able to restart an individual step that failed, rather than the entire refresh.
What about the staging database refresh?
You're right that most of this DAG uses AWS Operators to perform the steps required. The DAG is supposed to be able to recover on failure, but that recovery can't happen if the orchestrator itself is down! That said, in this case what would likely happen is that the Task in Airflow would "fail" while the actual operation would succeed. Example: Airflow issues a command to spin up a new database based on a snapshot, and is currently waiting for that snapshot to be live. The EC2 instance is replaced which kills this task. The database still comes up because AWS is still processing the command, the task just failed to finish waiting. In this case we'd want to verify the operation itself had actually completed, then just manually mark the task as succeeded so the rest of the DAG could continue.
Does the batched update DAG, if cancelled in the middle of an update task, track that fact and able to recover?
I believe this exact case has already happened actually, and it should be trivial to restart the DAG from that point in updating the tables.
The Postgres operations may be exactly similar to the AWS ones described above -- with the command issued already, it might complete just fine but the Airflow task itself will show as failed. This one might need some experimentation, but ideally those operations would be idempotent, too, (as you mention) so that we could simply restart them in the case where Airflow goes down.
Description
We should audit our current set of DAGs to make sure their behavior is consistent with what's described above, and that all DAGs (critical or not) would be able to be restarted easily if the worker is terminated during the course of the run.
Metadata
Assignees
Labels
Type
Projects
Status
📋 Backlog