Description
Dangling indices are indices that exist on disk on one or more nodes but which do not currently exist in the cluster state. They arise in a number of situations, such as:
- A user overflows the index graveyard by deleting more than 500 indices while a node is offline and then the node rejoins the cluster
- A node (unsafely) moves from one cluster to another, perhaps because the original cluster lost all its master nodes
- A user (unsafely) meddles with the contents of the data path, maybe restoring an old index folder from a backup
- A disk partially fails and the user has no replicas and no snapshots and wants to (unsafely) recover whatever they can
- A cluster loses all master nodes and those are (unsafely) restored from backup, but the backup does not contain the index.
Today we greedily and automatically import any dangling indices found on disk if possible, with surprising results:
- A deleted index may suddenly reappear when a node joins the cluster.
- A user may delete an index and see the immediate creation of another index with the same name, containing stale mappings and old data. They may start to index into this ancient index before realising. Data loss abounds.
- We may not be able to find copies of all of the shards of the index, resulting in a red cluster state.
- We do not attempt to import the freshest metadata for the index, and use a possibly-stale copy of the in-sync set to pick primaries. Data loss abounds.
What can we do about this?
In the long run we would prefer to avoid auto-importing dangling indices, but we must recognise that there are some desperate situations where a dangling index import is the best option and must therefore continue to support it. Rather than automatically importing a dangling index as soon as it is discovered, we could offer an API to help users manage their dangling indices. Something like this:
GET /_dangling
Gets a list of the dangling indices across the cluster. The response could include the index metadata (the one with the highest version in case of conflict) and mappings and some information about the underlying shards to help the user decide whether it should be deleted without needing to import it first.
DELETE /_dangling/$INDEX_UUID
Marks the dangling index for deletion.
POST /_dangling/$INDEX_UUID
Imports the given index into the cluster. This would require a body with accept_data_loss: true
. It may be necessary to allow dangling indices to be recovered under a different name too. Maybe we should allow specifying a particular node in case of conflicting metadata versions.
It should also be possible to use a wildcard i.e. POST /_dangling/*
, so that if a user is in a desperate situation, they can still quickly import any dangling indices without having to iterate over the whole list.
With this API we would warn the user about the existence of dangling indices through some UI (e.g. periodic log messages, or something in Kibana) and it would then be up to them to resolve that warning at their convenience.
The API sketch above is predicated on being able to disable the automatic import of dangling indices. We propose to introduce a new setting, which will default to disabling automatic imports. At a later date we will remove the setting, along with the automatic imports functionality since it is inherently unsafe.
Steps
- Write test case to exercise the dangling indices case New setting to prevent automatically importing dangling indices #49174
- Introduce new setting to disable automatic dangling index imports New setting to prevent automatically importing dangling indices #49174
- Add API for:
- Listing dangling indices
- Importing a dangling index
- Deleting a dangling index
- Add documentation
- Default the new settings to off.
- Remove auto-import from
v8.0.0
- Add list of dangling indices to the support diagnostics tool