[RFC] Cleanup unrefrenced files incase segment merge fails

## Overview
OpenSearch periodically merges multiple smaller segment into larger segments to keep the index size at bay and to expunge deletes. An entire merge operation requires at least 3x times space of the segments that are merged (with maximum of 3x times shard size in case of force merge with max segment = 1). In case enough space is not available for the segment merge to go through, the merge will fail at any of the intermediate step. Unreferenced files created during segment merge failure can take up a lot of space, leading to a full data volume and potential node drops.

### Unreferenced Files
In Lucene, unreferenced files are shard files which are no longer live and used by the shard. Any Lucene commits (`segments_N` files) does not reference these files. Neither they are actively updated by IndexWriter. When segment merges fail because of disk getting full, multiple unreferenced files can get generated. This continues to occupy a lot of space in data volume. Lucene [intentionally](https://issues.apache.org/jira/browse/LUCENE-6579) does not delete these unreferenced files by marking disk full issue as a tragedy. Lucene avoid wasting unnecessary IO/CPU cycles by doing this (in case Lucene cleans up these files, the same segment merge will be retired, filling up the space and again clearing unreferenced files in a loop wasting CPU/IO cycles).

## Proposed Solutions
This document analyses multiple proposals to cleanup these unreferenced files in case segment merge fails.

### Approach 1 (Inside Lucene)

One of the approach can be Lucene handling cleanup of unreferenced files in case segment merge fails because of disk full. As discussed above, in case segment merge fails because of disk full, [Lucene considers it as a Tragedy](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L5612). In such scenarios, Lucene [skips deleting the unreferenced files](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L2512) and closes the IndexWriter and later, OpenSearch closes the corresponding shard.

Since disk getting 100% full does not necessarily mean the system cannot recover (like in case of VirtualMachineError), IOException because of disk full should not be tragic. Instead, in such a scenario, Lucene should delete the unreferenced files generated during the failed segment merge.

Once cleanup completes, Lucene can close the IndexWriter instance as usual to avoid any parallel operation (like parallel writes, relocation, etc) which can change the shard state during unreferenced files cleanup. After this, OpenSearch will close InternalEngine and mark the shard as unassigned. Prior to which it will trigger a local shard recovery, which will reopen the InternalEngine and IndexWriter bringing the shard state to STARTED.

In order to avoid above operations to get in a loop (segment merge → clean up → segment merge), we can further change Lucene merge policy to not allow segment merges in case enough amount of space is not available. This will prevent IO/CPU cycles wastage.

#### Discussion Thread
https://github.com/apache/lucene/issues/12228

Pros

1. The biggest benefit of using this approach is Lucene will continue to manage Segment files. OpenSearch will not care about handling the unreferenced files generated from Lucene.
2. This seems a cleaner approach as OpenSearch does not need to wait on Lucene operations to complete, while performing unreferenced files clean up operation.

Cons

1. Lucene does not have the concepts of blocks. In case we make this change inside Lucene, there can be a scenario where we will allow writes even though segment merges will not happen. This will cause segment counts to grow and we can start reaching open file descriptors limits causing performance issues.

### Approach 2 (Inside OpenSearch)

Another approach to handle unreferenced file cleanup is to handle it within OpenSearch itself. In this approach, OpenSearch will perform the cleanup once the shard is closed and marked as unassigned [at the end of](https://github.com/opensearch-project/OpenSearch/blob/c81668c917323b928afcf13ddd3d1db644057680/server/src/main/java/org/opensearch/index/engine/Engine.java#L1294) the `failEngine` function inside `InternalEngine`. Before we clean the unreferenced files, we will validate whether the engine failed because of data volume getting full during segment merge. If yes, we will perform the cleanup.

Since only Lucene have information about which files are unreferenced (Lucene maintains this info inside `IndexFileDeleter`), we will remove unreferenced files inside OpenSearch by creating an instance of `IndexWriter` inside a try with resource block (since this approach is [already being used](https://github.com/opensearch-project/OpenSearch/blob/87735098b57849607cde85f44d04c43fc2ee19e4/server/src/main/java/org/opensearch/common/lucene/Lucene.java#L223) in OpenSearch). Internally, it creates a [new instance of](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1152) IndexFileDeleter  which will [remove](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexFileDeleter.java#L236) the [files not referenced by any of the commits](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexFileDeleter.java#L236).

Once the shard gets closed and cleanup is completed, OpenSearch marks this shard as unassigned. Since this is a cluster state change (shard state changed to unassigned) and there is already a valid copy of the shard on the node, OpenSearch will trigger a [local recovery](https://github.com/opensearch-project/OpenSearch/blob/f27172bd3f3088bf5d6d9f2c0218e175415b060f/server/src/main/java/org/opensearch/cluster/routing/RecoverySource.java#L173) for this shard. Once Recovery is completed and the translog is played on this shard, shard state is marked as `STARTED`.

In order to prevent cleanup operation and segment merge to not to get in a loop (segment merge → clean up → segment merge), we can further change OpenSearch merge policy to not allow segment merges in case enough amount of space is not available. This will prevent IO/CPU cycles wastage.

Pros

1. It seems a cleaner approach as we do not need explicitly handle multithreading scenarios while cleaning up unreferenced files.
2. Since clean up is performed after shard is failed, no parallel operation on that shard will be ongoing while unreferenced files are cleaned up.
3. Also we do not handle operations like reopening IndexWriter and IndexReader. Local shard recovery handles all these after cleanup completes.

Cons

1. We need to fail the shard in this approach.

## How Can You Help?
1. Provide early feedback on any issue which we see with above approaches.

## Next Steps
We will incorporate feedback and continue with more concrete prototypes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Cleanup unrefrenced files incase segment merge fails #8024

Overview

Unreferenced Files

Proposed Solutions

Approach 1 (Inside Lucene)

Discussion Thread

Approach 2 (Inside OpenSearch)

How Can You Help?

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development