Reduce number of writes needed for metadata updates

Today we split the on-disk cluster metadata across many files: one file for the metadata of each index, plus one file for the global metadata and another for the manifest. Most metadata updates only touch a few of these files, but some must write them all. If a node holds a large number of indices then it's possible its disks are not fast enough to process a complete metadata update before timing out. In severe cases affecting master-eligible nodes this can prevent an election from succeeding.

We plan to change the format of on-disk metadata to reduce the number of writes needed during metadata updates. One option is a monolithic file containing the complete metadata, but this is inefficient in the common case that the metadata is mostly unchanged. Another option is to keep an append-only log of changes, but such a log must be compacted and this introduces quite some complexity. However we already have access to a very good storage mechanism that has the right kinds of properties: Lucene! We will use a dedicated Lucene index on each master-eligible node and replace each individual file with a document in this index. Most metadata updates will need only a few writes, and Lucene's background merging will take care of compaction.

On master-ineligible nodes we can keep the existing format and still reduce the writes required, because we can make better use of the fact that master-ineligible nodes only write committed metadata and therefore the version numbers are trustworthy. It may also be possible to avoid writing index metadata during cluster state application entirely and defer it until later.

- [x] Introduce Lucene-based storage mechanism for metadata on master-eligible nodes (#48733)
- [x] Stop writing file-based metadata for indices that aren't assigned to the node (#49234)
- [x] Deal with node repurposing (#50179)
- [x] BWC: upgrading from older storage mechanisms, dealing with cleanup and intermediate states left by failures
- [x] BWC: upgrading from 7.x to 8.x and dealing with the change in the Lucene version
- [x] (nice-to-have) Move metadata persistence on master-ineligible nodes off the critical path for cluster state application (#50782)
- [x] Reimplement unsafe bootstrapping & cluster detachment (including suppressed `CoordinatorTests#testCannotJoinClusterWithDifferentUUID` test case) (#50179)
- [x] Implement rescue tool to deal with broken settings/customs (#50694, #50813)
- [x] ~~Implement rescue tool for when the node ID file is lost (ref. https://github.com/elastic/elasticsearch/pull/48733#discussion_r345240833)~~ Replaced by folding node metadata into new storage and using that as authoritative source: #50741
- [x] Expose information on performance (warn on slow persistence, track time spent in background merges, number of segments, index size etc.) (#50956)

Later:

- [ ] Investigate alternative merging strategies -- is there a benefit in merging in the background after a commit rather than doing it inline while flushing?
- [ ] Investigate the performance of duplicated indexing across multiple data paths and contemplate alternatives (ref. https://github.com/elastic/elasticsearch/pull/48733#discussion_r343598526)
- [ ] Optimize file-based metadata storage to trust metadata versions
- [ ] Implement rescue tool for when global metadata document is missing or when there are duplicated docs (ref. https://github.com/elastic/elasticsearch/pull/48733#discussion_r345247959)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce number of writes needed for metadata updates #48701

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce number of writes needed for metadata updates #48701

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions