Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Segment Replication] Shard failures on node stop/restart #6578

Closed
dreamer-89 opened this issue Mar 8, 2023 · 0 comments · Fixed by #6660
Closed

[BUG] [Segment Replication] Shard failures on node stop/restart #6578

dreamer-89 opened this issue Mar 8, 2023 · 0 comments · Fixed by #6660
Assignees
Labels

Comments

@dreamer-89
Copy link
Member

When a node (say Node A) containing primary with on going replication leave the cluster; it results in replica shard failure on target (Node B) due to NodeClosedException. For one replica count, this also leads to a red cluster because both primary (on node A) and replica (on node B) are unassigned. This can be resolved by handling the exceptions gracefully on target when node leaves the cluster.

[2023-03-07T22:11:37,498][ERROR][o.o.i.r.SegmentReplicationTargetService] [ip-10-0-4-54.us-west-2.compute.internal] replication failure
org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:365) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.7.0.jar:2.7.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.7.0.jar:2.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [ip-10-0-5-14.us-west-2.compute.internal][10.0.5.14:9300][internal:index/shard/replication/get_segment_files]
Caused by: org.opensearch.transport.SendRequestTransportException: [ip-10-0-4-54.us-west-2.compute.internal][10.0.4.54:9300][internal:index/shard/replication/file_chunk]
        at org.opensearch.transport.TransportService.sendRequestInternal(TransportService.java:941) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.TransportService.sendRequest(TransportService.java:815) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.TransportService.sendRequest(TransportService.java:758) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.RetryableTransportClient$1.tryAction(RetryableTransportClient.java:91) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.support.RetryableAction$1.doRun(RetryableAction.java:137) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.support.RetryableAction.run(RetryableAction.java:115) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.RetryableTransportClient.executeRetryableAction(RetryableTransportClient.java:106) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.replication.RemoteSegmentFileChunkWriter.writeFileChunk(RemoteSegmentFileChunkWriter.java:117) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.replication.SegmentFileTransferHandler$1.executeChunkRequest(SegmentFileTransferHandler.java:148) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.replication.SegmentFileTransferHandler$1.executeChunkRequest(SegmentFileTransferHandler.java:97) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:149) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:98) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:129) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:117) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:98) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:109) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$3(MultiChunkTransfer.java:151) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1404) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.7.0.jar:2.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: org.opensearch.node.NodeClosedException: node closed {ip-10-0-5-14.us-west-2.compute.internal}{TfIe0XASSY-qM9pY1gCmow}{mdJAYWGfSJmU13TV20PaMA}{10.0.5.14}{10.0.5.14:9300}{di}{shard_indexing_pressure_enabled=true}
        at org.opensearch.transport.TransportService.sendRequestInternal(TransportService.java:922) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.TransportService.sendRequest(TransportService.java:815) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.TransportService.sendRequest(TransportService.java:758) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.RetryableTransportClient$1.tryAction(RetryableTransportClient.java:91) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.support.RetryableAction$1.doRun(RetryableAction.java:137) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.support.RetryableAction.run(RetryableAction.java:115) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.RetryableTransportClient.executeRetryableAction(RetryableTransportClient.java:106) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.replication.RemoteSegmentFileChunkWriter.writeFileChunk(RemoteSegmentFileChunkWriter.java:117) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.replication.SegmentFileTransferHandler$1.executeChunkRequest(SegmentFileTransferHandler.java:148) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.replication.SegmentFileTransferHandler$1.executeChunkRequest(SegmentFileTransferHandler.java:97) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:149) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:98) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:129) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:117) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:98) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:109) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$3(MultiChunkTransfer.java:151) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1404) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[opensearch-2.7.0.jar:2.7.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.7.0.jar:2.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]

[2023-03-07T22:11:37,878][INFO ][o.o.c.r.a.AllocationService] [seed] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[nyc_taxis][20], [nyc_taxis][20]]]).

Repro steps

  1. Create a multi node cluster with large shard count
  2. Stop opensearch process on one node while there is heavy indexing on-going (works well with nyc_taxis OpenSearch-Benchmark). The more number of shards on stopped node, the more chances of an on-going replication event. This results in cluster going red for one replica setup.
  3. Cluster manager brings the primary up on Node B which contains previously copied files

Expected

Shard should not be marked failed on target when node containing primary goes down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant