Reindex resiliency

We want to make reindex resilient to node restarts and failures, such that reindex can continue to run across such events.

There are two primary problems to solve:
* Data node resiliency. Reindex relies on scroll queries which are not resilient.
* Coordinator node resiliency. Reindex runs on the host receiving the request and cannot survive if that node dies or is restarted.

Search resiliency
- [ ] Search ordered by seq_no and handle query failures by retrying from last seq_no (inclusive)
- [ ] Support reindex from remote when source version above 6.6+
- [ ] Add support for alternative numeric ordering attribute, particularly useful for remote index against pre-6.5 source.
- [ ] Back-off strategy on repeated failures
- [ ] Verify overhead of seq_no ordering

Coordinator node resiliency:
- [ ] POC to clarify this subject more (#43382)
- [ ] Decide on start reindex job action name
    - `indices:data/write/start_reindex`
    - `indices:admin/reindex/start_reindex`
    - `cluster:admin/reindex/start_reindex`
    - `indices:data/reindex/start_reindex`
- [ ] Decide on persistent reindex task name
- [ ] Evaluate how we want to do timeouts for waiting on initial task creation or reindex task completion
- [ ] Refactor common parts from data frames and roll-up
- [ ] Add reindex persistent task and remove it when done (#43382)
- [ ] Allocation of reindex persistent task (#43382)
- [ ] Store progress information periodically into .tasks index
- [ ] Resume from existing progress information when allocated to new node
- [ ] Make updates to persistent tasks resilient against master failovers
- [ ] Support async durability on destination, ensuring data in checkpoint is fsync'ed into destination

Slicing:
- [ ] Investigate having multiple in flight search and bulk requests as an alternative

Benchmarking:
- [ ] Compare rally original indexing to reindex
- [ ] Overhead of scripting and ingest pipelines

Misc:
- [ ] Handle write failures by retrying when appropriate
- [ ] Refined error handling, filter out known/retryable errors
- [ ] HLRC support for new persistent task id.
- [ ] Examine if transport client in 7.x can call resilient reindex (workaround).
- [ ] Add serialization tests for get reindex request

Docs
- [ ] Clarify how to use resilient reindex in reference docs (conflict handling, parameters)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reindex resiliency #42612

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reindex resiliency #42612

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions