Tool to remove corrupted parts of corrupt shards

Today, if we detect shard corruption then we mark the store as corrupt and refuse to open it again. If there are no replicas then [you might be able to use Lucene’s CheckIndex to remove the corrupted segments](https://www.elastic.co/blog/found-dive-into-elasticsearch-storage#fixing-problematic-shards), although this does not remove the corruption marker, requires knowledge of our filesystem layout, and might be tricky to do in a containerised or heavily automated environment. The only way forward via the API is to [force the allocation of an empty primary](https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html#_forced_allocation_on_unrecoverable_errors) which drops all the data in the shard. We have an `index.shard.check_on_startup: fix` setting but this is suboptimal for a couple of reasons:

- it’s index-wide and requires closing and verifying the whole index.
- it has no effect on shards that have a corruption marker, because the corruption marker is checked before this option takes effect.

(it also does nothing in versions 6.0 and above, but that's another story)

The Right Way™ to recover a corrupted shard is certainly to fail it and recover another copy from one of its replicas, assuming such a thing exists, but we’ve seen a couple of cases recently where a user was running without replicas, e.g. to do a bulk load of data ([which we sorta suggest might be a good idea sometimes](https://www.elastic.co/guide/en/elasticsearch/reference/6.3/indices-update-settings.html#bulk)) and hit some corruption that they'd have preferred to recover from with a bit of data loss rather than by restarting the load or allocating an empty primary. 

I propose removing the `fix` option of the `index.shard.check_on_startup` setting and instead adding another dangerous forced allocation command that can attempt to allocate a primary on top of a corrupt store by fixing the store and removing its corruption marker.

/cc @tsouza @ywelsch re. [this forum thread](https://discuss.elastic.co/t/corrupted-elastic-index/135932)

---
Actual points and opened questions:
* Tool name: `elasticsearch-shard` with subcommand `remove-corrupted-segments` 
  * the main goal is to fix corrupted index - the action is destructive - therefore no any **fix** or **repair**, avoid **truncate** as it is far from Lucene terminology 
* Available options for `remove-corrupted-segments`:
  *  `--index-name index_name` and `--shard-id shard_id` (**mandatory**)
      * alternative: `-d path_to_index_folder` or `--dir path_to_index_folder`
  * `--dry-run` do fast check without actual dropping of corrupted segments
  * **no options** means `exorcise` - interactive keyboard confirmation is required
* merge `elasticsearch-translog` into `elasticsearch-shard` 
  * `elasticsearch-translog` becomes `elasticsearch-shard truncate-translog`
  * `elasticsearch-translog` has only `-d` option to specify folder - it would be nice to have `--index-name index_name` and `--shard-id shard_id` 
* Exit immediately if there is no corruption marker file 
  * for both cases 
* actually missed segments are unrecoverable case with `checkIndex` 
  * we leave it as unrecoverable case - with referring to [how to allocate an empty shard](https://www.elastic.co/guide/en/elasticsearch/reference/master/cluster-reroute.html)
  * there is a room for improvement - [LUCENE-6762](https://issues.apache.org/jira/browse/LUCENE-6762).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tool to remove corrupted parts of corrupt shards #31389

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tool to remove corrupted parts of corrupt shards #31389

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions