[BUG][Segment Replication] ReplicationFailedException and ALLOCATION_FAILED

**Describe the bug**

Shard failure, reason [replication failure]], failure [ReplicationFailedException
Failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException.||

**Logs:**

`[2023-09-08T09:28:10,201][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[ylgTCi8VSk-iytumnaGxlg], [R], s[STARTED], a[id=1OR-X-XDTrCMLvXWr8k0sw], message [shard failure, reason [replication failure]], failure [ReplicationFailedException[[ppe-000298][0]: Replication failed on  (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], markAsStale [true]]
[2023-09-08T09:28:10,201][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[ylgTCi8VSk-iytumnaGxlg], [R], s[STARTED], a[id=1OR-X-XDTrCMLvXWr8k0sw], message [shard failure, reason [replication failure]], failure [ReplicationFailedException[[ppe-000298][0]: Replication failed on  (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], markAsStale [true]]
[2023-09-08T09:28:20,522][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[EDhutdeXT5W5luFLpIF7sw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=5cUw3rGZSbuWLOSrQygkvA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T09:28:10.201Z], failed_attempts[1], delayed=false, details[failed shard on node [ylgTCi8VSk-iytumnaGxlg]: shard failure, reason [replication failure], failure ReplicationFailedException[[ppe-000298][0]: Replication failed on  (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], allocation_status[no_attempt]], expected_shard_size[13863289464], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [19241322ms]]; ], markAsStale [true]]
[2023-09-08T09:28:20,522][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[EDhutdeXT5W5luFLpIF7sw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=5cUw3rGZSbuWLOSrQygkvA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T09:28:10.201Z], failed_attempts[1], delayed=false, details[failed shard on node [ylgTCi8VSk-iytumnaGxlg]: shard failure, reason [replication failure], failure ReplicationFailedException[[ppe-000298][0]: Replication failed on  (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], allocation_status[no_attempt]], expected_shard_size[13863289464], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [19241322ms]]; ], markAsStale [true]]
[2023-09-08T09:28:30,607][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[ylgTCi8VSk-iytumnaGxlg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=A7J2Jm_7SAOY2RLIXH-qZA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T09:28:20.522Z], failed_attempts[2], failed_nodes[[EDhutdeXT5W5luFLpIF7sw]], delayed=false, details[failed shard on node [EDhutdeXT5W5luFLpIF7sw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [19241322ms]]; ], allocation_status[no_attempt]], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [20397ms]]; ], markAsStale [true]]
[2023-09-08T09:28:30,607][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298]`


It seems segment replication event failed due to index corruption exception because of missing segment file
NoSuchFileException "/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe" doesn't exist
and ShardLockObtainFailedException on shard 0.

`{
  "index": "ppe-000298",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "ALLOCATION_FAILED",
    "at": "2023-09-08T14:12:35.637Z",
    "failed_allocation_attempts": 5,
    "details": "failed shard on node [ylgTCi8VSk-iytumnaGxlg]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [17065425ms]]; ",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "17zuTMYtQ9KvKKNa7gm0Ig",
      "node_name": “\data-az1-1",
      "transport_address": “*.*.*.*:9300",
      "node_attributes": {
        "zone": "az1",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T14:12:35.637Z], failed_attempts[5], failed_nodes[[EDhutdeXT5W5luFLpIF7sw, ylgTCi8VSk-iytumnaGxlg]], delayed=false, details[failed shard on node [ylgTCi8VSk-iytumnaGxlg]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [17065425ms]]; ], allocation_status[no_attempt]]]"
        }
      ]
    }`


Eventually replica shards fall behind primary for too long and huge lagging.

**Screenshots**
index                       shard prirep state          docs  store ip           node
ppe-000298 1     p      STARTED    33768999 31.4gb 10.*.*.*   data-az2-3
ppe-000298 1     r      STARTED     1412441  1.3gb 10.*.*.* data-az1-4
ppe-000298 2     p      STARTED    33763101 35.3gb 10.*.*.* data-az1-6
ppe-000298 2     r      STARTED     5928658  5.1gb 10.*.*.*   data-az2-2
ppe-000298 0     p      STARTED    33758088 30.1gb 10.*.*.*   data-az2-1
ppe-000298 0     r      UNASSIGNED   

**Host/Environment (please complete the following information):**
 - OS: Linux
 - Version: 2.8.0

We have tried to manually reroute the shard allocation but that didn't help.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][Segment Replication] ReplicationFailedException and ALLOCATION_FAILED #9966

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development