Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RemoteStoreIT.testRemoteSegmentStoreRestore test failure #6273

Closed
dreamer-89 opened this issue Feb 9, 2023 · 1 comment
Closed

[BUG] RemoteStoreIT.testRemoteSegmentStoreRestore test failure #6273

dreamer-89 opened this issue Feb 9, 2023 · 1 comment
Labels
bug Something isn't working untriaged

Comments

@dreamer-89
Copy link
Member

RemoteStoreIT.testRemoteSegmentStoreRestore test failure identified in #6270 (comment). The test fails due missing segments file in store

IndexShardRecoveryException[Exception while recovering from remote store]; nested: FileNotFoundException[no segments* file found in store(ByteSizeCachingDirectory(HybridDirectory@/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreIT_A212B9AEEBA86195-001/tempDir-002/node_t3/nodes/0/indices/LTdVnhQjQCGRCn11Wz6X-A/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@1ca06935)): files: []]; ], markAsStale [true]]

Gradle check link: https://build.ci.opensearch.org/job/gradle-check/10936/

Failure trace from gradle check above

  1> [2023-02-10T05:03:19,668][INFO ][o.o.c.c.Coordinator      ] [node_t1] cluster-manager node [{node_t0}{nEbHVfTHQGCwqdvtidsCKg}{Dvs5SuOvSvCIxWJDA7Nstg}{127.0.0.1}{127.0.0.1:41329}{m}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
  1> org.opensearch.transport.NodeDisconnectedException: [node_t0][127.0.0.1:41329][disconnected] disconnected
  1> [2023-02-10T05:03:19,669][INFO ][o.o.c.c.FollowersChecker ] [node_t0] FollowerChecker{discoveryNode={node_t1}{m4dKRiBxQ2eh98vCqoRaoA}{lLj5IX3hS-2Tpsg_4pkkog}{127.0.0.1}{127.0.0.1:36295}{d}{shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=3} disconnected
  1> [2023-02-10T05:03:19,669][INFO ][o.o.c.c.FollowersChecker ] [node_t0] FollowerChecker{discoveryNode={node_t1}{m4dKRiBxQ2eh98vCqoRaoA}{lLj5IX3hS-2Tpsg_4pkkog}{127.0.0.1}{127.0.0.1:36295}{d}{shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=3} marking node as faulty
  1> [2023-02-10T05:03:19,676][INFO ][o.o.c.s.MasterService    ] [node_t0] node-left[{node_t1}{m4dKRiBxQ2eh98vCqoRaoA}{lLj5IX3hS-2Tpsg_4pkkog}{127.0.0.1}{127.0.0.1:36295}{d}{shard_indexing_pressure_enabled=true} reason: disconnected], term: 1, version: 54, delta: removed {{node_t1}{m4dKRiBxQ2eh98vCqoRaoA}{lLj5IX3hS-2Tpsg_4pkkog}{127.0.0.1}{127.0.0.1:36295}{d}{shard_indexing_pressure_enabled=true}}
  1> [2023-02-10T05:03:19,789][INFO ][o.o.c.s.ClusterApplierService] [node_t3] removed {{node_t1}{m4dKRiBxQ2eh98vCqoRaoA}{lLj5IX3hS-2Tpsg_4pkkog}{127.0.0.1}{127.0.0.1:36295}{d}{shard_indexing_pressure_enabled=true}}, term: 1, version: 54, reason: ApplyCommitRequest{term=1, version=54, sourceNode={node_t0}{nEbHVfTHQGCwqdvtidsCKg}{Dvs5SuOvSvCIxWJDA7Nstg}{127.0.0.1}{127.0.0.1:41329}{m}{shard_indexing_pressure_enabled=true}}
  1> [2023-02-10T05:03:19,790][INFO ][o.o.c.s.ClusterApplierService] [node_t2] removed {{node_t1}{m4dKRiBxQ2eh98vCqoRaoA}{lLj5IX3hS-2Tpsg_4pkkog}{127.0.0.1}{127.0.0.1:36295}{d}{shard_indexing_pressure_enabled=true}}, term: 1, version: 54, reason: ApplyCommitRequest{term=1, version=54, sourceNode={node_t0}{nEbHVfTHQGCwqdvtidsCKg}{Dvs5SuOvSvCIxWJDA7Nstg}{127.0.0.1}{127.0.0.1:41329}{m}{shard_indexing_pressure_enabled=true}}
  1> [2023-02-10T05:03:19,791][INFO ][o.o.c.s.ClusterApplierService] [node_t0] removed {{node_t1}{m4dKRiBxQ2eh98vCqoRaoA}{lLj5IX3hS-2Tpsg_4pkkog}{127.0.0.1}{127.0.0.1:36295}{d}{shard_indexing_pressure_enabled=true}}, term: 1, version: 54, reason: Publication{term=1, version=54}
  1> [2023-02-10T05:03:19,913][INFO ][o.o.i.s.RemoteSegmentStoreDirectory] [node_t1] Number of commits in remote segment store=0, lastNMetadataFilesToKeep=10
  1> [2023-02-10T05:03:19,913][ERROR][o.o.i.s.RemoteStoreRefreshListener] [node_t1] Exception in RemoteStoreRefreshListener.afterRefresh()
  1> org.apache.lucene.store.AlreadyClosedException: engine is closed
  1> 	at org.opensearch.index.shard.IndexShard.getEngine(IndexShard.java:3194) ~[main/:?]
  1> 	at org.opensearch.index.shard.IndexShard.getSegmentInfosSnapshot(IndexShard.java:4473) ~[main/:?]
  1> 	at org.opensearch.index.shard.RemoteStoreRefreshListener.afterRefresh(RemoteStoreRefreshListener.java:107) [main/:?]
  1> 	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) [lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) [lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) [lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1843) [main/:?]
  1> 	at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1972) [main/:?]
  1> 	at org.opensearch.index.engine.Engine.flush(Engine.java:1222) [main/:?]
  1> 	at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:1982) [main/:?]
  1> 	at org.opensearch.index.shard.IndexShard.close(IndexShard.java:1780) [main/:?]
  1> 	at org.opensearch.index.IndexService.closeShard(IndexService.java:607) [main/:?]
  1> 	at org.opensearch.index.IndexService.removeShard(IndexService.java:587) [main/:?]
  1> 	at org.opensearch.index.IndexService.close(IndexService.java:373) [main/:?]
  1> 	at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:1047) [main/:?]
  1> 	at org.opensearch.indices.IndicesService.lambda$doStop$4(IndicesService.java:550) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
  1> 	at java.lang.Thread.run(Thread.java:833) [?:?]
  1> [2023-02-10T05:03:19,971][INFO ][o.o.n.Node               ] [testRemoteSegmentStoreRestore] stopped
  1> [2023-02-10T05:03:19,971][INFO ][o.o.n.Node               ] [testRemoteSegmentStoreRestore] closing ...
  1> [2023-02-10T05:03:19,986][INFO ][o.o.n.Node               ] [testRemoteSegmentStoreRestore] closed
  1> [2023-02-10T05:03:19,991][INFO ][o.o.c.m.MetadataIndexStateService] [node_t0] closing indices [remote-store-test-idx-1/LTdVnhQjQCGRCn11Wz6X-A]
  1> [2023-02-10T05:03:20,029][INFO ][o.o.c.m.MetadataIndexStateService] [node_t0] completed closing of indices [remote-store-test-idx-1]
  1> [2023-02-10T05:03:20,357][INFO ][o.o.p.PluginsService     ] [node_t3] PluginService:onIndexModule index:[remote-store-test-idx-1/LTdVnhQjQCGRCn11Wz6X-A]
  1> [2023-02-10T05:03:20,405][INFO ][o.o.i.s.RemoteSegmentStoreDirectory] [node_t3] No metadata file found, this can happen for new index with no data uploaded to remote segment store
  1> [2023-02-10T05:03:20,486][INFO ][o.o.i.s.IndexShard       ] [node_t3] [remote-store-test-idx-1][0] Downloading segments from remote segment store
  1> [2023-02-10T05:03:20,487][INFO ][o.o.i.s.RemoteSegmentStoreDirectory] [node_t3] No metadata file found, this can happen for new index with no data uploaded to remote segment store
  1> [2023-02-10T05:03:20,488][INFO ][o.o.i.s.IndexShard       ] [node_t3] [remote-store-test-idx-1][0] Downloaded segments: []
  1> [2023-02-10T05:03:20,488][INFO ][o.o.i.s.IndexShard       ] [node_t3] [remote-store-test-idx-1][0] Skipped download for segments: []
  1> [2023-02-10T05:03:20,491][WARN ][o.o.i.c.IndicesClusterStateService] [node_t3] [remote-store-test-idx-1][0] marking and sending shard failed due to [failed recovery]
  1> org.opensearch.indices.recovery.RecoveryFailedException: [remote-store-test-idx-1][0]: Recovery failed on {node_t3}{9YXi7x8sR72ZMK-lIyM1rg}{LIBVr4ulTjCPF8r9ozB2-w}{127.0.0.1}{127.0.0.1:35541}{d}{shard_indexing_pressure_enabled=true}
  1> 	at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$32(IndexShard.java:3341) [main/:?]
  1> 	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [main/:?]
  1> 	at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$7(StoreRecovery.java:428) [main/:?]
  1> 	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [main/:?]
  1> 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:345) [main/:?]
  1> 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:125) [main/:?]
  1> 	at org.opensearch.index.shard.IndexShard.restoreFromRemoteStore(IndexShard.java:2448) [main/:?]
  1> 	at org.opensearch.index.shard.IndexShard.lambda$startRecovery$27(IndexShard.java:3245) [main/:?]
  1> 	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
  1> 	at java.lang.Thread.run(Thread.java:833) [?:?]
  1> Caused by: org.opensearch.index.shard.IndexShardRecoveryException: Exception while recovering from remote store
  1> 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:473) ~[main/:?]
  1> 	at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromRemoteStore$1(StoreRecovery.java:127) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[main/:?]
  1> 	... 9 more
  1> Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(ByteSizeCachingDirectory(HybridDirectory@/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreIT_A212B9AEEBA86195-001/tempDir-002/node_t3/nodes/0/indices/LTdVnhQjQCGRCn11Wz6X-A/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@1ca06935)): files: []
  1> 	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:804) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:760) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:538) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:528) ~[lucene-core-9.5.0.jar:9.5.0 13803aa6ea7fee91f798cfeded4296182ac43a21 - 2023-01-25 16:44:59]
  1> 	at org.opensearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:166) ~[main/:?]
  1> 	at org.opensearch.index.store.Store.readSegmentsInfo(Store.java:245) ~[main/:?]
  1> 	at org.opensearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:227) ~[main/:?]
  1> 	at org.opensearch.index.store.Store.bootstrapNewHistory(Store.java:1746) ~[main/:?]
  1> 	at org.opensearch.index.shard.StoreRecovery.bootstrap(StoreRecovery.java:660) ~[main/:?]
  1> 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:463) ~[main/:?]
  1> 	at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromRemoteStore$1(StoreRecovery.java:127) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[main/:?]
  1> 	... 9 more
  1> [2023-02-10T05:03:20,502][WARN ][o.o.c.r.a.AllocationService] [node_t0] failing shard [failed shard, shard [remote-store-test-idx-1][0], node[9YXi7x8sR72ZMK-lIyM1rg], [P], recovery_source[remote store recovery [aJM8g6RaSVeJkjxplDe2uQ]], s[INITIALIZING], a[id=ItwpK7MgSCCV_Bx9Nziylw], unassigned_info[[reason=EXISTING_INDEX_RESTORED], at[2023-02-09T17:03:20.177Z], delayed=false, details[restore_source[remote_store]], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[remote-store-test-idx-1][0]: Recovery failed on {node_t3}{9YXi7x8sR72ZMK-lIyM1rg}{LIBVr4ulTjCPF8r9ozB2-w}{127.0.0.1}{127.0.0.1:35541}{d}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[Exception while recovering from remote store]; nested: FileNotFoundException[no segments* file found in store(ByteSizeCachingDirectory(HybridDirectory@/var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.remotestore.RemoteStoreIT_A212B9AEEBA86195-001/tempDir-002/node_t3/nodes/0/indices/LTdVnhQjQCGRCn11Wz6X-A/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@1ca06935)): files: []]; ], markAsStale [true]]
  1> org.opensearch.indices.recovery.RecoveryFailedException: [remote-store-test-idx-1][0]: Recovery failed on {node_t3}{9YXi7x8sR72ZMK-lIyM1rg}{LIBVr4ulTjCPF8r9ozB2-w}{127.0.0.1}{127.0.0.1:35541}{d}{shard_indexing_pressure_enabled=true}
  1> 	at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$32(IndexShard.java:3341) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[main/:?]
  1> 	at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$7(StoreRecovery.java:428) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:345) ~[main/:?]
  1> 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:125) ~[main/:?]
  1> 	at org.opensearch.index.shard.IndexShard.restoreFromRemoteStore(IndexShard.java:2448) ~[main/:?]
  1> 	at org.opensearch.index.shard.IndexShard.lambda$startRecovery$27(IndexShard.java:3245) ~[main/:?]
  1> 	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
  1> 	at java.lang.Thread.run(Thread.java:833) [?:?]
  1> Caused by: org.opensearch.index.shard.IndexShardRecoveryException: Exception while recovering from remote store
  1> 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:473) ~[main/:?]
  1> 	at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromRemoteStore$1(StoreRecovery.java:127) ~[main/:?]
  1> 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[main/:?]
  1> 	... 9 more
@dreamer-89 dreamer-89 added bug Something isn't working untriaged labels Feb 9, 2023
@dreamer-89
Copy link
Member Author

Duplicate of #6248 and resolved in #6252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
None yet
Development

No branches or pull requests

1 participant