Description
Today we split the on-disk cluster metadata across many files: one file for the metadata of each index, plus one file for the global metadata and another for the manifest. Most metadata updates only touch a few of these files, but some must write them all. If a node holds a large number of indices then it's possible its disks are not fast enough to process a complete metadata update before timing out. In severe cases affecting master-eligible nodes this can prevent an election from succeeding.
We plan to change the format of on-disk metadata to reduce the number of writes needed during metadata updates. One option is a monolithic file containing the complete metadata, but this is inefficient in the common case that the metadata is mostly unchanged. Another option is to keep an append-only log of changes, but such a log must be compacted and this introduces quite some complexity. However we already have access to a very good storage mechanism that has the right kinds of properties: Lucene! We will use a dedicated Lucene index on each master-eligible node and replace each individual file with a document in this index. Most metadata updates will need only a few writes, and Lucene's background merging will take care of compaction.
On master-ineligible nodes we can keep the existing format and still reduce the writes required, because we can make better use of the fact that master-ineligible nodes only write committed metadata and therefore the version numbers are trustworthy. It may also be possible to avoid writing index metadata during cluster state application entirely and defer it until later.
- Introduce Lucene-based storage mechanism for metadata on master-eligible nodes (Introduce Lucene-based metadata persistence #48733)
- Stop writing file-based metadata for indices that aren't assigned to the node (Remove per-index metadata without assigned shards #49234)
- Deal with node repurposing (Add command-line tool support for Lucene-based metadata storage #50179)
- BWC: upgrading from older storage mechanisms, dealing with cleanup and intermediate states left by failures
- BWC: upgrading from 7.x to 8.x and dealing with the change in the Lucene version
- (nice-to-have) Move metadata persistence on master-ineligible nodes off the critical path for cluster state application (Write CS asynchronously on data-only nodes #50782)
- Reimplement unsafe bootstrapping & cluster detachment (including suppressed
CoordinatorTests#testCannotJoinClusterWithDifferentUUID
test case) (Add command-line tool support for Lucene-based metadata storage #50179) - Implement rescue tool to deal with broken settings/customs (Remove persistent cluster settings tool #50694, Remove custom metadata tool #50813)
-
Implement rescue tool for when the node ID file is lost (ref. Introduce Lucene-based metadata persistence #48733 (comment))Replaced by folding node metadata into new storage and using that as authoritative source: Fold node metadata into new node storage #50741 - Expose information on performance (warn on slow persistence, track time spent in background merges, number of segments, index size etc.) (Warn on slow metadata performance #50956)
Later:
- Investigate alternative merging strategies -- is there a benefit in merging in the background after a commit rather than doing it inline while flushing?
- Investigate the performance of duplicated indexing across multiple data paths and contemplate alternatives (ref. Introduce Lucene-based metadata persistence #48733 (comment))
- Optimize file-based metadata storage to trust metadata versions
- Implement rescue tool for when global metadata document is missing or when there are duplicated docs (ref. Introduce Lucene-based metadata persistence #48733 (comment))