Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][Segment Replication] Stuck Initializing Shard w/ Segment Replication enabled #6084

Closed
mattweber opened this issue Jan 30, 2023 · 10 comments
Assignees
Labels
bug Something isn't working distributed framework

Comments

@mattweber
Copy link
Contributor

Describe the bug
I have recently enabled segment replication on an OpenSearch 2.5.0 cluster and ran into a stuck initializing shard somehow. The only thing different on this cluster is 2.5.0 upgrade + enabling segment replication. I am leaning toward being related to the segment replication since it is experimental.

To Reproduce
I have not been able to reproduce, not sure how it even happened in first place.

Expected behavior
Shards initialize and finish

Additional context
I didn't see anything suspicious in the logs and issue was easy to resolve using the shard reroute api.

@mattweber mattweber added bug Something isn't working untriaged labels Jan 30, 2023
@reta
Copy link
Collaborator

reta commented Jan 30, 2023

@mattweber could it be related to this bug [1]? do you have any exceptions in the log to pinpoint the issue?

[1] #5701

@dreamer-89
Copy link
Member

dreamer-89 commented Jan 30, 2023

Thank you @mattweber for reporting this. I have also seen this error once while working on #5898 where replica shard remained INITIALIZING and never caught up with primary. Performing_refresh and _flush also did not change the state. I did not cut issue as I couldn't repro the issue. One thing to note, this happened on an actual cluster, with 3 data node, 1 master node; with 1 shard, 1 repilca as index settings; for only index I created.

@mattweber
Copy link
Contributor Author

@reta maybe, I will go dig though the logs again and see what I find. Its a little rough as I have 48 nodes and not shipping logs in this environment.

@dreamer-89 yes I debated even opening this as well and decided just in case someone else might have hit it and have more info. Looks like that was a good call. You can resolve it using the cluster reroute api w/ cancel + allocate_replica commands.

My particular index was 24 shards, 1 replica and only a single replica ran into this issue.

@anasalkouz anasalkouz changed the title [BUG] Stuck Initializing Shard w/ Segment Replication enabled [BUG][Segment Replication] Stuck Initializing Shard w/ Segment Replication enabled Jan 31, 2023
@dreamer-89 dreamer-89 self-assigned this Mar 6, 2023
@dreamer-89
Copy link
Member

Looking into it.

@dreamer-89
Copy link
Member

dreamer-89 commented Mar 7, 2023

Not able to reproduce this on main or latest 2.x using following.

  1. Integ test. Created integration test to mimic the setup with large number of primaries and node restarts. Code link
  2. Manual experiment. Perform node stops during data ingestion. Used nyc_taxis workload from OpenSearch--Benchmark to perform data ingestion. Using 3 cluster-manager nodes, 3 data nodes setup which was used previously.

Index settings

{
    "index_settings":{
        "number_of_replicas": 1,
        "number_of_shards": 24,
        "index":{
           "replication": {"type": "SEGMENT"}
        }
    }
}

@dreamer-89
Copy link
Member

dreamer-89 commented Mar 7, 2023

Able to repro the issue on older 2.x branch (containing commits upto Jan 30 around timeline when this issue was reported).

Failure

I see shards getting stuck in initialization for couple of reasons

Ongoing replication

On cluster-manager node, I see shard failure exceptions (code link) for stuck shard [nyc_taxis][14] because of on-going replication (log trace below). This happens when source node receives a get_checkpoint_info from target but there is already an on-going round of segment replication.

Log trace

[2023-03-07T18:40:05,057][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-3-212.us-west-2.compute.internal] replication failure
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:302) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.6.0.jar:2.6.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [ip-10-0-4-214.us-west-2.compute.internal][10.0.4.214:9300][internal:index/shard/replication/get_checkpoint_info]
Caused by: org.opensearch.OpenSearchException: Shard copy [nyc_taxis][14] on node {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} already replicating
        at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:147) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:134) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:115) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
[2023-03-07T18:40:05,081][WARN ][o.o.i.e.Engine           ] [ip-10-0-3-212.us-west-2.compute.internal] [nyc_taxis][14] failed engine [replication failure]
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:302) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.6.0.jar:2.6.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [ip-10-0-4-214.us-west-2.compute.internal][10.0.4.214:9300][internal:index/shard/replication/get_checkpoint_info]
Caused by: org.opensearch.OpenSearchException: Shard copy [nyc_taxis][14] on node {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} already replicating
        at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:147) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:134) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:115) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
[2023-03-07T18:40:05,087][WARN ][o.o.i.c.IndicesClusterStateService] [ip-10-0-3-212.us-west-2.compute.internal] [nyc_taxis][14] marking and sending shard failed due to [shard failure, reason [replication failure]]
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:302) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-2.6.0.jar:2.6.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [ip-10-0-4-214.us-west-2.compute.internal][10.0.4.214:9300][internal:index/shard/replication/get_checkpoint_info]
Caused by: org.opensearch.OpenSearchException: Shard copy [nyc_taxis][14] on node {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} already replicating
        at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:147) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:134) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:115) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]

nyc_taxis[14] switches to STARTED state but has much lower data.

➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/shards?v"                                                             
index     shard prirep state           docs store ip         node
...
nyc_taxis 14    r      INITIALIZING               10.0.3.212 ip-10-0-3-212.us-west-2.compute.internal
...
nyc_taxis 14    r      STARTED       980147 204.6mb 10.0.3.212 ip-10-0-3-212.us-west-2.compute.internal
nyc_taxis 14    p      STARTED      6890004   1.7gb 10.0.4.214 ip-10-0-4-214.us-west-2.compute.internal
➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/segments?v"  | grep " 14 "
nyc_taxis 14    r      10.0.3.212 _1s             64     246944            0   51.7mb           0 true      true       9.4.2   false
nyc_taxis 14    r      10.0.3.212 _43            147     384796            0     81mb           0 true      true       9.4.2   false
nyc_taxis 14    r      10.0.3.212 _46            150      10425            0    2.1mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _4e            158      38342            0    7.8mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _4q            170      60419            0   12.4mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _50            180      42664            0    8.7mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _59            189      36522            0    7.5mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _5k            200      21388            0    4.4mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _5u            210      29155            0      6mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _64            220      35870            0    7.4mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _6e            230      40824            0    8.3mb           0 true      true       9.4.2   true
nyc_taxis 14    r      10.0.3.212 _6p            241      32798            0    6.7mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _ac            372     483388            0  101.3mb           0 true      true       9.4.2   false
nyc_taxis 14    p      10.0.4.214 _ol            885    3014538            0  627.3mb           0 true      true       9.4.2   false
nyc_taxis 14    p      10.0.4.214 _rn            995     446877            0   93.4mb           0 true      true       9.4.2   false
nyc_taxis 14    p      10.0.4.214 _uf           1095     253889            0   53.2mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _wx           1185     220814            0   46.2mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _zg           1276     195674            0   41.1mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _11p          1357     475241            0   99.3mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _14g          1456     443813            0   92.9mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _17k          1568     462023            0   96.5mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1ab          1667     412496            0   86.4mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1cv          1759       4066            0  890.8kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1cw          1760       3805            0  829.4kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1cx          1761       4502            0  971.9kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1cy          1762       5041            0      1mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1cz          1763       3804            0  829.3kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d0          1764       5006            0      1mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d1          1765       4553            0  993.4kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d2          1766       4404            0    951kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d4          1768       3692            0  795.7kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d5          1769     415671            0   87.6mb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d6          1770       2904            0  631.3kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d7          1771       4164            0  895.6kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d8          1772       3346            0  719.3kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1d9          1773       4176            0  905.4kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1da          1774       4519            0  971.9kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1db          1775       4105            0  875.3kb           0 true      true       9.4.2   true
nyc_taxis 14    p      10.0.4.214 _1dc          1776       3493            0  735.4kb           0 true      true       9.4.2   true

Stuck recovery

Stuck peer recovery in translog state even though all operations are completed 100%. This might be happening as we were previously recovering translog operations upto global checkpoint rather than all available. Fixed in #6366

[2023-03-07T19:35:41,924][WARN ][o.o.i.c.IndicesClusterStateService] [ip-10-0-4-214.us-west-2.compute.internal] [nyc_taxis][0] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [nyc_taxis][0]: Recovery failed from {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} into {ip-10-0-4-214.us-west-2.compute.internal}{uuGXp5y-QfSfiD9TMG1ELw}{dsvHBBZrSyiDU_mJJEGv4Q}{10.0.4.214}{10.0.4.214:9300}{di}{shard_indexing_pressure_enabled=true} (no activity after [30m])
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection$ReplicationMonitor.doRun(ReplicationCollection.java:300) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.OpenSearchTimeoutException: no activity after [30m]
        ... 6 more
[2023-03-07T19:36:00,498][WARN ][o.o.i.c.IndicesClusterStateService] [ip-10-0-4-214.us-west-2.compute.internal] [nyc_taxis][12] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [nyc_taxis][12]: Recovery failed from {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} into {ip-10-0-4-214.us-west-2.compute.internal}{uuGXp5y-QfSfiD9TMG1ELw}{dsvHBBZrSyiDU_mJJEGv4Q}{10.0.4.214}{10.0.4.214:9300}{di}{shard_indexing_pressure_enabled=true} (no activity after [30m])
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection$ReplicationMonitor.doRun(ReplicationCollection.java:300) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.OpenSearchTimeoutException: no activity after [30m]
        ... 6 more
[2023-03-07T19:36:00,568][WARN ][o.o.i.c.IndicesClusterStateService] [ip-10-0-4-214.us-west-2.compute.internal] [nyc_taxis][6] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [nyc_taxis][6]: Recovery failed from {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} into {ip-10-0-4-214.us-west-2.compute.internal}{uuGXp5y-QfSfiD9TMG1ELw}{dsvHBBZrSyiDU_mJJEGv4Q}{10.0.4.214}{10.0.4.214:9300}{di}{shard_indexing_pressure_enabled=true} (no activity after [30m])
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection$ReplicationMonitor.doRun(ReplicationCollection.java:300) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.OpenSearchTimeoutException: no activity after [30m]
        ... 6 more
[2023-03-07T19:36:00,607][WARN ][o.o.i.c.IndicesClusterStateService] [ip-10-0-4-214.us-west-2.compute.internal] [nyc_taxis][18] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [nyc_taxis][18]: Recovery failed from {ip-10-0-3-212.us-west-2.compute.internal}{gG95UUzJTfiBg9AXRb-OAQ}{GFGDoa4bSCS1c1ccBx5zvg}{10.0.3.212}{10.0.3.212:9300}{di}{shard_indexing_pressure_enabled=true} into {ip-10-0-4-214.us-west-2.compute.internal}{uuGXp5y-QfSfiD9TMG1ELw}{dsvHBBZrSyiDU_mJJEGv4Q}{10.0.4.214}{10.0.4.214:9300}{di}{shard_indexing_pressure_enabled=true} (no activity after [30m])
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.indices.replication.common.ReplicationCollection$ReplicationMonitor.doRun(ReplicationCollection.java:300) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.OpenSearchTimeoutException: no activity after [30m]
➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/shards?v"                                                             
index     shard prirep state           docs store ip         node
...
nyc_taxis 12    r      INITIALIZING               10.0.4.214 ip-10-0-4-214.us-west-2.compute.internal
nyc_taxis 6     r      INITIALIZING               10.0.4.214 ip-10-0-4-214.us-west-2.compute.internal
nyc_taxis 18    r      INITIALIZING               10.0.4.214 ip-10-0-4-214.us-west-2.compute.internal
nyc_taxis 0     r      INITIALIZING               10.0.4.214 ip-10-0-4-214.us-west-2.compute.internal
...
➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/recovery?active_only=true&v=true" 
index     shard time  type stage    source_host source_node                              target_host target_node                              repository snapshot files files_recovered files_percent files_total bytes     bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
nyc_taxis 0     24.4m peer translog 10.0.3.212  ip-10-0-3-212.us-west-2.compute.internal 10.0.4.214  ip-10-0-4-214.us-west-2.compute.internal n/a        n/a      92    92              100.0%        92          124655858 124655858       100.0%        124655858   192573       192573                 100.0%
nyc_taxis 6     24.1m peer translog 10.0.3.212  ip-10-0-3-212.us-west-2.compute.internal 10.0.4.214  ip-10-0-4-214.us-west-2.compute.internal n/a        n/a      104   104             100.0%        104         123827602 123827602       100.0%        123827602   246428       246428                 100.0%
nyc_taxis 12    24.1m peer translog 10.0.3.212  ip-10-0-3-212.us-west-2.compute.internal 10.0.4.214  ip-10-0-4-214.us-west-2.compute.internal n/a        n/a      108   108             100.0%        108         123746451 123746451       100.0%        123746451   245567       245567                 100.0%
nyc_taxis 18    24.1m peer translog 10.0.3.212  ip-10-0-3-212.us-west-2.compute.internal 10.0.4.214  ip-10-0-4-214.us-west-2.compute.internal n/a        n/a      58    58              100.0%        58          124230649 124230649       100.0%        124230649   243704       243704                 100.0%

@dreamer-89
Copy link
Member

Resolving this issue as it is not repro'able on latest 2.x and main.

@dreamer-89
Copy link
Member

dreamer-89 commented Mar 13, 2023

Hit this bug today while benchmarking the changes in #6643. Reopening.

Setup 25 primary with 1 replica, 3 data node setup. The problem happened when stopping opensearch process on one node containing nyc_taxis[22] primary. This resulted in replica's promotion as primary followed by recovery to third node 10.0.4.133 (here). The recovery stats show 100% completed but somehow the shard state is not changed to STARTED.

➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/recovery?active_only=true&v=true" 
index     shard time type stage    source_host source_node                             target_host target_node                              repository snapshot files files_recovered files_percent files_total bytes     bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
nyc_taxis 22    5m   peer translog 10.0.5.39   ip-10-0-5-39.us-west-2.compute.internal 10.0.4.133  ip-10-0-4-133.us-west-2.compute.internal n/a        n/a      70    70              100.0%        70          105104077 105104077       100.0%        105104077   492100       492100                 100.0%

@mch2
Copy link
Member

mch2 commented May 19, 2023

@dreamer-89

From what you posted it looks like the recovery is still in progress in "translog" state. For segrep replica recoveries the translog step will only write the ops to xlog.

➜  opensearch-cluster-cdk git:(inclusive_naming_changes) ✗ curl -s -X GET "$host/_cat/recovery?active_only=true&v=true" 
index     shard time type stage    source_host source_node                             target_host target_node                              repository snapshot files files_recovered files_percent files_total bytes     bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
nyc_taxis 22    5m   peer translog 10.0.5.39   ip-10-0-5-39.us-west-2.compute.internal 10.0.4.133  ip-10-0-4-133.us-west-2.compute.internal n/a        n/a      70    70              100.0%        70          105104077 105104077       100.0%        105104077   492100       492100                 100.0%

I haven't hit this bug where a replica never finishes recovery, but I think we have a bug here in that we are attempting to copy more xlog ops where necessary, slowing recovery. The startingSeqNo is always sent as -2 (UNASSIGNED_SEQ_NO) when starting recovery with SR. This triggers a file based recovery and uses the local cp of the safe commit as the starting point of ops to recover. This means even if the replica has those ops already in its xlog they will get sent. I think we can optimize this and use the global cp stored in the local xlog on the replica as the starting point. After ops are sent over we still perform a round of segrep to catch up from the safe commit & purge xlog where necessary.

Edit: With SR we still attempt to recover from the safe commit so my previous statement is not accurate.

@mch2
Copy link
Member

mch2 commented Jun 6, 2023

Closing this as not reproducible, please re-open if this happens with OS 2.7+.

@mch2 mch2 closed this as completed Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
Status: Done
Development

No branches or pull requests

6 participants