[BUG][Segment Replication] Stuck Initializing Shard w/ Segment Replication enabled #6084

mattweber · 2023-01-30T17:40:44Z

Describe the bug
I have recently enabled segment replication on an OpenSearch 2.5.0 cluster and ran into a stuck initializing shard somehow. The only thing different on this cluster is 2.5.0 upgrade + enabling segment replication. I am leaning toward being related to the segment replication since it is experimental.

To Reproduce
I have not been able to reproduce, not sure how it even happened in first place.

Expected behavior
Shards initialize and finish

Additional context
I didn't see anything suspicious in the logs and issue was easy to resolve using the shard reroute api.

reta · 2023-01-30T17:52:17Z

@mattweber could it be related to this bug [1]? do you have any exceptions in the log to pinpoint the issue?

[1] #5701

dreamer-89 · 2023-01-30T17:58:12Z

Thank you @mattweber for reporting this. I have also seen this error once while working on #5898 where replica shard remained INITIALIZING and never caught up with primary. Performing_refresh and _flush also did not change the state. I did not cut issue as I couldn't repro the issue. One thing to note, this happened on an actual cluster, with 3 data node, 1 master node; with 1 shard, 1 repilca as index settings; for only index I created.

mattweber · 2023-01-30T18:01:55Z

@reta maybe, I will go dig though the logs again and see what I find. Its a little rough as I have 48 nodes and not shipping logs in this environment.

@dreamer-89 yes I debated even opening this as well and decided just in case someone else might have hit it and have more info. Looks like that was a good call. You can resolve it using the cluster reroute api w/ cancel + allocate_replica commands.

My particular index was 24 shards, 1 replica and only a single replica ran into this issue.

dreamer-89 · 2023-03-06T19:57:42Z

Looking into it.

dreamer-89 · 2023-03-07T19:06:38Z

Not able to reproduce this on main or latest 2.x using following.

Integ test. Created integration test to mimic the setup with large number of primaries and node restarts. Code link
Manual experiment. Perform node stops during data ingestion. Used nyc_taxis workload from OpenSearch--Benchmark to perform data ingestion. Using 3 cluster-manager nodes, 3 data nodes setup which was used previously.

Index settings

{
    "index_settings":{
        "number_of_replicas": 1,
        "number_of_shards": 24,
        "index":{
           "replication": {"type": "SEGMENT"}
        }
    }
}

dreamer-89 · 2023-03-07T19:37:45Z

Able to repro the issue on older 2.x branch (containing commits upto Jan 30 around timeline when this issue was reported).

Failure

I see shards getting stuck in initialization for couple of reasons

Ongoing replication

On cluster-manager node, I see shard failure exceptions (code link) for stuck shard [nyc_taxis][14] because of on-going replication (log trace below). This happens when source node receives a get_checkpoint_info from target but there is already an on-going round of segment replication.

Log trace

[2023-03-07T18:40:05,057][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-3-212.us-west-2.compute.internal] replication failure
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:302) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.6.0.jar:2.6.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [ip-10-0-4-214.us-west-2.compute.internal][10.0.4.214:9300][internal:index/shard/replication/get_checkpoint_info]
Caused by: org.opensearch.OpenSearchException: Shard copy [nyc_taxis][14] on node {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} already replicating
        at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:147) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:134) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:115) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
[2023-03-07T18:40:05,081][WARN ][o.o.i.e.Engine           ] [ip-10-0-3-212.us-west-2.compute.internal] [nyc_taxis][14] failed engine [replication failure]
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:302) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.6.0.jar:2.6.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [ip-10-0-4-214.us-west-2.compute.internal][10.0.4.214:9300][internal:index/shard/replication/get_checkpoint_info]
Caused by: org.opensearch.OpenSearchException: Shard copy [nyc_taxis][14] on node {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} already replicating
        at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:147) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:134) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:115) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
[2023-03-07T18:40:05,087][WARN ][o.o.i.c.IndicesClusterStateService] [ip-10-0-3-212.us-west-2.compute.internal] [nyc_taxis][14] marking and sending shard failed due to [shard failure, reason [replication failure]]
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:302) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-2.6.0.jar:2.6.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [ip-10-0-4-214.us-west-2.compute.internal][10.0.4.214:9300][internal:index/shard/replication/get_checkpoint_info]
Caused by: org.opensearch.OpenSearchException: Shard copy [nyc_taxis][14] on node {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} already replicating
        at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:147) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:134) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:115) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]

nyc_taxis[14] switches to STARTED state but has much lower data.

➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/shards?v"                                                             
index     shard prirep state           docs store ip         node
...
nyc_taxis 14    r      INITIALIZING               10.0.3.212 ip-10-0-3-212.us-west-2.compute.internal
...

nyc_taxis 14    r      STARTED       980147 204.6mb 10.0.3.212 ip-10-0-3-212.us-west-2.compute.internal
nyc_taxis 14    p      STARTED      6890004   1.7gb 10.0.4.214 ip-10-0-4-214.us-west-2.compute.internal

➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/segments?v"  | grep " 14 "
nyc_taxis 14    r      10.0.3.212 _1s             64     246944            0   51.7mb           0 true      true       9.4.2   false
nyc_taxis 14    r      10.0.3.212 _43            147     384796            0     81mb           0 true      true       9.4.2   false
nyc_taxis 14    r      10.0.3.212 _46            150      10425            0    2.1mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _4e            158      38342            0    7.8mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _4q            170      60419            0   12.4mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _50            180      42664            0    8.7mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _59            189      36522            0    7.5mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _5k            200      21388            0    4.4mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _5u            210      29155            0      6mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _64            220      35870            0    7.4mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _6e            230      40824            0    8.3mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _6p            241      32798            0    6.7mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _ac            372     483388            0  101.3mb           0 true      true       9.4.2   false
nyc_taxis 14    p      10.0.4.214 _ol            885    3014538            0  627.3mb           0 true      true       9.4.2   false
nyc_taxis 14    p      10.0.4.214 _rn            995     446877            0   93.4mb           0 true      true       9.4.2   false
nyc_taxis 14    p      10.0.4.214 _uf           1095     253889            0   53.2mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _wx           1185     220814            0   46.2mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _zg           1276     195674            0   41.1mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _11p          1357     475241            0   99.3mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _14g          1456     443813            0   92.9mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _17k          1568     462023            0   96.5mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1ab          1667     412496            0   86.4mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1cv          1759       4066            0  890.8kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1cw          1760       3805            0  829.4kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1cx          1761       4502            0  971.9kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1cy          1762       5041            0      1mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1cz          1763       3804            0  829.3kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d0          1764       5006            0      1mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d1          1765       4553            0  993.4kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d2          1766       4404            0    951kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d4          1768       3692            0  795.7kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d5          1769     415671            0   87.6mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d6          1770       2904            0  631.3kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d7          1771       4164            0  895.6kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d8          1772       3346            0  719.3kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d9          1773       4176            0  905.4kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1da          1774       4519            0  971.9kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1db          1775       4105            0  875.3kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1dc          1776       3493            0  735.4kb           0 true      true       9.4.2   true

Stuck recovery

Stuck peer recovery in translog state even though all operations are completed 100%. This might be happening as we were previously recovering translog operations upto global checkpoint rather than all available. Fixed in #6366

[2023-03-07T19:35:41,924][WARN ][o.o.i.c.IndicesClusterStateService] [ip-10-0-4-214.us-west-2.compute.internal] [nyc_taxis][0] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [nyc_taxis][0]: Recovery failed from {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} into {ip-10-0-4-214.us-west-2.compute.internal}{uuGXp5y-QfSfiD9TMG1ELw}{dsvHBBZrSyiDU_mJJEGv4Q}{10.0.4.214}{10.0.4.214:9300}{di}{shard_indexing_pressure_enabled=true} (no activity after [30m])
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection$ReplicationMonitor.doRun(ReplicationCollection.java:300) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.OpenSearchTimeoutException: no activity after [30m]
        ... 6 more
[2023-03-07T19:36:00,498][WARN ][o.o.i.c.IndicesClusterStateService] [ip-10-0-4-214.us-west-2.compute.internal] [nyc_taxis][12] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [nyc_taxis][12]: Recovery failed from {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} into {ip-10-0-4-214.us-west-2.compute.internal}{uuGXp5y-QfSfiD9TMG1ELw}{dsvHBBZrSyiDU_mJJEGv4Q}{10.0.4.214}{10.0.4.214:9300}{di}{shard_indexing_pressure_enabled=true} (no activity after [30m])
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection$ReplicationMonitor.doRun(ReplicationCollection.java:300) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.OpenSearchTimeoutException: no activity after [30m]
        ... 6 more
[2023-03-07T19:36:00,568][WARN ][o.o.i.c.IndicesClusterStateService] [ip-10-0-4-214.us-west-2.compute.internal] [nyc_taxis][6] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [nyc_taxis][6]: Recovery failed from {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} into {ip-10-0-4-214.us-west-2.compute.internal}{uuGXp5y-QfSfiD9TMG1ELw}{dsvHBBZrSyiDU_mJJEGv4Q}{10.0.4.214}{10.0.4.214:9300}{di}{shard_indexing_pressure_enabled=true} (no activity after [30m])
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection$ReplicationMonitor.doRun(ReplicationCollection.java:300) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.OpenSearchTimeoutException: no activity after [30m]
        ... 6 more
[2023-03-07T19:36:00,607][WARN ][o.o.i.c.IndicesClusterStateService] [ip-10-0-4-214.us-west-2.compute.internal] [nyc_taxis][18] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [nyc_taxis][18]: Recovery failed from {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} into {ip-10-0-4-214.us-west-2.compute.internal}{uuGXp5y-QfSfiD9TMG1ELw}{dsvHBBZrSyiDU_mJJEGv4Q}{10.0.4.214}{10.0.4.214:9300}{di}{shard_indexing_pressure_enabled=true} (no activity after [30m])
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection$ReplicationMonitor.doRun(ReplicationCollection.java:300) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.OpenSearchTimeoutException: no activity after [30m]

➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/shards?v"                                                             
index     shard prirep state           docs store ip         node
...
nyc_taxis 12    r      INITIALIZING               10.0.4.214 ip-10-0-4-214.us-west-2.compute.internal
nyc_taxis 6     r      INITIALIZING               10.0.4.214 ip-10-0-4-214.us-west-2.compute.internal
nyc_taxis 18    r      INITIALIZING               10.0.4.214 ip-10-0-4-214.us-west-2.compute.internal
nyc_taxis 0     r      INITIALIZING               10.0.4.214 ip-10-0-4-214.us-west-2.compute.internal
...

➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/recovery?active_only=true&v=true" 
index     shard time  type stage    source_host source_node                              target_host target_node                              repository snapshot files files_recovered files_percent files_total bytes     bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
nyc_taxis 0     24.4m peer translog 10.0.3.212  ip-10-0-3-212.us-west-2.compute.internal 10.0.4.214  ip-10-0-4-214.us-west-2.compute.internal n/a        n/a      92    92              100.0%        92          124655858 124655858       100.0%        124655858   192573       192573                 100.0%
nyc_taxis 6     24.1m peer translog 10.0.3.212  ip-10-0-3-212.us-west-2.compute.internal 10.0.4.214  ip-10-0-4-214.us-west-2.compute.internal n/a        n/a      104   104             100.0%        104         123827602 123827602       100.0%        123827602   246428       246428                 100.0%
nyc_taxis 12    24.1m peer translog 10.0.3.212  ip-10-0-3-212.us-west-2.compute.internal 10.0.4.214  ip-10-0-4-214.us-west-2.compute.internal n/a        n/a      108   108             100.0%        108         123746451 123746451       100.0%        123746451   245567       245567                 100.0%
nyc_taxis 18    24.1m peer translog 10.0.3.212  ip-10-0-3-212.us-west-2.compute.internal 10.0.4.214  ip-10-0-4-214.us-west-2.compute.internal n/a        n/a      58    58              100.0%        58          124230649 124230649       100.0%        124230649   243704       243704                 100.0%

dreamer-89 · 2023-03-08T00:29:56Z

Resolving this issue as it is not repro'able on latest 2.x and main.

dreamer-89 · 2023-03-13T19:52:12Z

Hit this bug today while benchmarking the changes in #6643. Reopening.

Setup 25 primary with 1 replica, 3 data node setup. The problem happened when stopping opensearch process on one node containing nyc_taxis[22] primary. This resulted in replica's promotion as primary followed by recovery to third node 10.0.4.133 (here). The recovery stats show 100% completed but somehow the shard state is not changed to STARTED.

➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/recovery?active_only=true&v=true" 
index     shard time type stage    source_host source_node                             target_host target_node                              repository snapshot files files_recovered files_percent files_total bytes     bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
nyc_taxis 22    5m   peer translog 10.0.5.39   ip-10-0-5-39.us-west-2.compute.internal 10.0.4.133  ip-10-0-4-133.us-west-2.compute.internal n/a        n/a      70    70              100.0%        70          105104077 105104077       100.0%        105104077   492100       492100                 100.0%

mch2 · 2023-05-19T21:01:52Z

@dreamer-89

From what you posted it looks like the recovery is still in progress in "translog" state. For segrep replica recoveries the translog step will only write the ops to xlog.

➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/recovery?active_only=true&v=true" 
index     shard time type stage    source_host source_node                             target_host target_node                              repository snapshot files files_recovered files_percent files_total bytes     bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
nyc_taxis 22    5m   peer translog 10.0.5.39   ip-10-0-5-39.us-west-2.compute.internal 10.0.4.133  ip-10-0-4-133.us-west-2.compute.internal n/a        n/a      70    70              100.0%        70          105104077 105104077       100.0%        105104077   492100       492100                 100.0%

I haven't hit this bug where a replica never finishes recovery, but I think we have a bug here in that we are attempting to copy more xlog ops where necessary, slowing recovery. The startingSeqNo is always sent as -2 (UNASSIGNED_SEQ_NO) when starting recovery with SR. This triggers a file based recovery and uses the local cp of the safe commit as the starting point of ops to recover. This means even if the replica has those ops already in its xlog they will get sent. I think we can optimize this and use the global cp stored in the local xlog on the replica as the starting point. After ops are sent over we still perform a round of segrep to catch up from the safe commit & purge xlog where necessary.

Edit: With SR we still attempt to recover from the safe commit so my previous statement is not accurate.

mch2 · 2023-06-06T23:57:48Z

Closing this as not reproducible, please re-open if this happens with OS 2.7+.

mattweber added bug Something isn't working untriaged labels Jan 30, 2023

dreamer-89 mentioned this issue Jan 30, 2023

[Meta] Promote Segment Replication out of experimental. #5147

Closed

16 tasks

minalsha added the distributed framework label Jan 30, 2023

anasalkouz changed the title ~~[BUG] Stuck Initializing Shard w/ Segment Replication enabled~~ [BUG][Segment Replication] Stuck Initializing Shard w/ Segment Replication enabled Jan 31, 2023

anasalkouz removed the untriaged label Jan 31, 2023

dreamer-89 self-assigned this Mar 6, 2023

dreamer-89 closed this as completed Mar 8, 2023

dreamer-89 reopened this Mar 13, 2023

github-actions bot added the untriaged label Mar 13, 2023

dreamer-89 removed the untriaged label Mar 13, 2023

anasalkouz added Migration:In Progress and removed Migration:In Progress labels Mar 17, 2023

anasalkouz unassigned dreamer-89 Mar 30, 2023

mch2 self-assigned this May 11, 2023

mch2 closed this as completed Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][Segment Replication] Stuck Initializing Shard w/ Segment Replication enabled #6084

[BUG][Segment Replication] Stuck Initializing Shard w/ Segment Replication enabled #6084

mattweber commented Jan 30, 2023

reta commented Jan 30, 2023 •

edited

Loading

dreamer-89 commented Jan 30, 2023 •

edited

Loading

mattweber commented Jan 30, 2023

dreamer-89 commented Mar 6, 2023

dreamer-89 commented Mar 7, 2023 •

edited

Loading

dreamer-89 commented Mar 7, 2023 •

edited

Loading

dreamer-89 commented Mar 8, 2023

dreamer-89 commented Mar 13, 2023 •

edited

Loading

mch2 commented May 19, 2023 •

edited

Loading

mch2 commented Jun 6, 2023

[BUG][Segment Replication] Stuck Initializing Shard w/ Segment Replication enabled #6084

[BUG][Segment Replication] Stuck Initializing Shard w/ Segment Replication enabled #6084

Comments

mattweber commented Jan 30, 2023

reta commented Jan 30, 2023 • edited Loading

dreamer-89 commented Jan 30, 2023 • edited Loading

mattweber commented Jan 30, 2023

dreamer-89 commented Mar 6, 2023

dreamer-89 commented Mar 7, 2023 • edited Loading

dreamer-89 commented Mar 7, 2023 • edited Loading

Failure

Ongoing replication

Log trace

Stuck recovery

dreamer-89 commented Mar 8, 2023

dreamer-89 commented Mar 13, 2023 • edited Loading

mch2 commented May 19, 2023 • edited Loading

mch2 commented Jun 6, 2023

reta commented Jan 30, 2023 •

edited

Loading

dreamer-89 commented Jan 30, 2023 •

edited

Loading

dreamer-89 commented Mar 7, 2023 •

edited

Loading

dreamer-89 commented Mar 7, 2023 •

edited

Loading

dreamer-89 commented Mar 13, 2023 •

edited

Loading

mch2 commented May 19, 2023 •

edited

Loading