Skip to content

Reindex resiliency #42612

Open
Open
@henningandersen

Description

@henningandersen

We want to make reindex resilient to node restarts and failures, such that reindex can continue to run across such events.

There are two primary problems to solve:

  • Data node resiliency. Reindex relies on scroll queries which are not resilient.
  • Coordinator node resiliency. Reindex runs on the host receiving the request and cannot survive if that node dies or is restarted.

Search resiliency

  • Search ordered by seq_no and handle query failures by retrying from last seq_no (inclusive)
  • Support reindex from remote when source version above 6.6+
  • Add support for alternative numeric ordering attribute, particularly useful for remote index against pre-6.5 source.
  • Back-off strategy on repeated failures
  • Verify overhead of seq_no ordering

Coordinator node resiliency:

  • POC to clarify this subject more (Make reindexing managed by a persistent task #43382)
  • Decide on start reindex job action name
    • indices:data/write/start_reindex
    • indices:admin/reindex/start_reindex
    • cluster:admin/reindex/start_reindex
    • indices:data/reindex/start_reindex
  • Decide on persistent reindex task name
  • Evaluate how we want to do timeouts for waiting on initial task creation or reindex task completion
  • Refactor common parts from data frames and roll-up
  • Add reindex persistent task and remove it when done (Make reindexing managed by a persistent task #43382)
  • Allocation of reindex persistent task (Make reindexing managed by a persistent task #43382)
  • Store progress information periodically into .tasks index
  • Resume from existing progress information when allocated to new node
  • Make updates to persistent tasks resilient against master failovers
  • Support async durability on destination, ensuring data in checkpoint is fsync'ed into destination

Slicing:

  • Investigate having multiple in flight search and bulk requests as an alternative

Benchmarking:

  • Compare rally original indexing to reindex
  • Overhead of scripting and ingest pipelines

Misc:

  • Handle write failures by retrying when appropriate
  • Refined error handling, filter out known/retryable errors
  • HLRC support for new persistent task id.
  • Examine if transport client in 7.x can call resilient reindex (workaround).
  • Add serialization tests for get reindex request

Docs

  • Clarify how to use resilient reindex in reference docs (conflict handling, parameters)

Metadata

Metadata

Labels

:Distributed Indexing/ReindexIssues relating to reindex that are not caused by issues further downMetaTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions