Open
Description
We want to make reindex resilient to node restarts and failures, such that reindex can continue to run across such events.
There are two primary problems to solve:
- Data node resiliency. Reindex relies on scroll queries which are not resilient.
- Coordinator node resiliency. Reindex runs on the host receiving the request and cannot survive if that node dies or is restarted.
Search resiliency
- Search ordered by seq_no and handle query failures by retrying from last seq_no (inclusive)
- Support reindex from remote when source version above 6.6+
- Add support for alternative numeric ordering attribute, particularly useful for remote index against pre-6.5 source.
- Back-off strategy on repeated failures
- Verify overhead of seq_no ordering
Coordinator node resiliency:
- POC to clarify this subject more (Make reindexing managed by a persistent task #43382)
- Decide on start reindex job action name
indices:data/write/start_reindex
indices:admin/reindex/start_reindex
cluster:admin/reindex/start_reindex
indices:data/reindex/start_reindex
- Decide on persistent reindex task name
- Evaluate how we want to do timeouts for waiting on initial task creation or reindex task completion
- Refactor common parts from data frames and roll-up
- Add reindex persistent task and remove it when done (Make reindexing managed by a persistent task #43382)
- Allocation of reindex persistent task (Make reindexing managed by a persistent task #43382)
- Store progress information periodically into .tasks index
- Resume from existing progress information when allocated to new node
- Make updates to persistent tasks resilient against master failovers
- Support async durability on destination, ensuring data in checkpoint is fsync'ed into destination
Slicing:
- Investigate having multiple in flight search and bulk requests as an alternative
Benchmarking:
- Compare rally original indexing to reindex
- Overhead of scripting and ingest pipelines
Misc:
- Handle write failures by retrying when appropriate
- Refined error handling, filter out known/retryable errors
- HLRC support for new persistent task id.
- Examine if transport client in 7.x can call resilient reindex (workaround).
- Add serialization tests for get reindex request
Docs
- Clarify how to use resilient reindex in reference docs (conflict handling, parameters)