Implement local distributed reindexing

## Problem


Blocked by #4146, #4147, https://github.com/WordPress/openverse-infrastructure/issues/849

This issue tracks adding the orchestration steps for the distributed reindex to the new data refresh DAGs.

## Description



In this step we will add tasks to the data refresh DAGs to orchestrate the distributed reindex. At the end of this step, it will be possible to run a distributed reindex locally, but because the infrastructure work to create the ASGs is not complete, it can not be run on production yet. The following code can all be refactored from [distributed_reindex_scheduler.py](https://github.com/WordPress/openverse/blob/main/ingestion_server/ingestion_server/distributed_reindex_scheduler.py).

- Use [describe_auto_scaling_groups](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/autoscaling/client/describe_auto_scaling_groups.html) and filter by tags to select the appropriate ASG for the desired environment. (Skips in local env.)
- Use [set_desired_capacity](https://boto3.amazonaws.com/v1/documentation/api/1.26.86/reference/services/autoscaling/client/set_desired_capacity.html) to increase the desired capacity of the ASG to the desired number of workers, depending on the environment. This will cause the ASG to begin spinning up instances. (Skips in local env.)
- Use [describe_auto_scaling_groups](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/autoscaling/client/describe_auto_scaling_groups.html) to poll the ASG until all instances have been started, and get the EC2 instance IDs. (Skips in local env.)
- Use dynamic task mapping to distribute reindexing across the indexer workers by first calculating start and end indices that will split the records in the media table into even portions, depending on the number of workers available in the given environment. Then:
  - POST to each worker’s `reindexing_task` endpoint the `start_index` and `end_index` it should handle
  - Use a Sensor to ping the worker’s `task/{task_id}` endpoint until the task is complete, logging the progress as it goes
- Use [terminate_instance_in_auto_scaling_group](https://boto3.amazonaws.com/v1/documentation/api/1.26.86/reference/services/autoscaling/client/terminate_instance_in_auto_scaling_group.html) to terminate the instance. Make sure to set `ShouldDecrementDesiredCapacity` to `True` to ensure that the ASG does not try to replace the instance. This task should use the [NONE_SKIPPED TriggerRule](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#trigger-rules) to ensure that the instances are terminated, even if there are upstream failures. (Skips in local env.)
- Finally, after all tasks have finished (regardless of success/failure), we should have a cleanup task that calls `set_desired_capacity` to 0. Generally this should be a no-op, but if an instance crashes during reindexing (rather than simply failing during reindexing) the ASG will spin up a replacement and Airflow will not automatically clean it up. This task ensures that any dangling instances are terminated.

## Additional context



See [this section](https://docs.openverse.org/projects/proposals/ingestion_server_removal/20240328-implementation_plan_ingestion_server_removal.html#implement-distributed-reindexing-locally) of the IP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement local distributed reindexing #4148

stacimc
openedon Apr 17, 2024

Problem

Description

Additional context

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement local distributed reindexing #4148

Description

stacimcopenedon Apr 17, 2024

Problem

Description

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

stacimc
openedon Apr 17, 2024