[BUG][Segment Replication] ReplicationFailedException and ALLOCATION_FAILED #9966
Description
Describe the bug
Shard failure, reason [replication failure]], failure [ReplicationFailedException
Failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException.||
Logs:
[2023-09-08T09:28:10,201][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[ylgTCi8VSk-iytumnaGxlg], [R], s[STARTED], a[id=1OR-X-XDTrCMLvXWr8k0sw], message [shard failure, reason [replication failure]], failure [ReplicationFailedException[[ppe-000298][0]: Replication failed on (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], markAsStale [true]] [2023-09-08T09:28:10,201][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[ylgTCi8VSk-iytumnaGxlg], [R], s[STARTED], a[id=1OR-X-XDTrCMLvXWr8k0sw], message [shard failure, reason [replication failure]], failure [ReplicationFailedException[[ppe-000298][0]: Replication failed on (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], markAsStale [true]] [2023-09-08T09:28:20,522][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[EDhutdeXT5W5luFLpIF7sw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=5cUw3rGZSbuWLOSrQygkvA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T09:28:10.201Z], failed_attempts[1], delayed=false, details[failed shard on node [ylgTCi8VSk-iytumnaGxlg]: shard failure, reason [replication failure], failure ReplicationFailedException[[ppe-000298][0]: Replication failed on (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], allocation_status[no_attempt]], expected_shard_size[13863289464], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [19241322ms]]; ], markAsStale [true]] [2023-09-08T09:28:20,522][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[EDhutdeXT5W5luFLpIF7sw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=5cUw3rGZSbuWLOSrQygkvA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T09:28:10.201Z], failed_attempts[1], delayed=false, details[failed shard on node [ylgTCi8VSk-iytumnaGxlg]: shard failure, reason [replication failure], failure ReplicationFailedException[[ppe-000298][0]: Replication failed on (failed to clean after replication)]; nested: CorruptIndexException[Problem reading index. (resource=/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe)]; nested: NoSuchFileException[/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe]; ], allocation_status[no_attempt]], expected_shard_size[13863289464], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [19241322ms]]; ], markAsStale [true]] [2023-09-08T09:28:30,607][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298][0], node[ylgTCi8VSk-iytumnaGxlg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=A7J2Jm_7SAOY2RLIXH-qZA], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T09:28:20.522Z], failed_attempts[2], failed_nodes[[EDhutdeXT5W5luFLpIF7sw]], delayed=false, details[failed shard on node [EDhutdeXT5W5luFLpIF7sw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [19241322ms]]; ], allocation_status[no_attempt]], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [20397ms]]; ], markAsStale [true]] [2023-09-08T09:28:30,607][WARN ][o.o.c.r.a.AllocationService] [master-2] failing shard [failed shard, shard [ppe-000298]
It seems segment replication event failed due to index corruption exception because of missing segment file
NoSuchFileException "/usr/share/opensearch/data/nodes/0/indices/rIJ86tpXTIG4h-Cn_MoPRg/0/index/_7tvv.cfe" doesn't exist
and ShardLockObtainFailedException on shard 0.
{ "index": "ppe-000298", "shard": 0, "primary": false, "current_state": "unassigned", "unassigned_info": { "reason": "ALLOCATION_FAILED", "at": "2023-09-08T14:12:35.637Z", "failed_allocation_attempts": 5, "details": "failed shard on node [ylgTCi8VSk-iytumnaGxlg]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [17065425ms]]; ", "last_allocation_status": "no_attempt" }, "can_allocate": "no", "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions": [ { "node_id": "17zuTMYtQ9KvKKNa7gm0Ig", "node_name": “\data-az1-1", "transport_address": “*.*.*.*:9300", "node_attributes": { "zone": "az1", "shard_indexing_pressure_enabled": "true" }, "node_decision": "no", "deciders": [ { "decider": "max_retry", "decision": "NO", "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-09-08T14:12:35.637Z], failed_attempts[5], failed_nodes[[EDhutdeXT5W5luFLpIF7sw, ylgTCi8VSk-iytumnaGxlg]], delayed=false, details[failed shard on node [ylgTCi8VSk-iytumnaGxlg]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[ppe-000298][0]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [17065425ms]]; ], allocation_status[no_attempt]]]" } ] }
Eventually replica shards fall behind primary for too long and huge lagging.
Screenshots
index shard prirep state docs store ip node
ppe-000298 1 p STARTED 33768999 31.4gb 10...* data-az2-3
ppe-000298 1 r STARTED 1412441 1.3gb 10...* data-az1-4
ppe-000298 2 p STARTED 33763101 35.3gb 10...* data-az1-6
ppe-000298 2 r STARTED 5928658 5.1gb 10...* data-az2-2
ppe-000298 0 p STARTED 33758088 30.1gb 10...* data-az2-1
ppe-000298 0 r UNASSIGNED
Host/Environment (please complete the following information):
- OS: Linux
- Version: 2.8.0
We have tried to manually reroute the shard allocation but that didn't help.