-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Closed
Labels
:Distributed Coordination/Snapshot/RestoreAnything directly related to the `_snapshot/*` APIsAnything directly related to the `_snapshot/*` APIs>test-failureTriaged test failures from CITriaged test failures from CIblockerv7.2.0v8.0.0-alpha1
Description
... for some value of the word "reproducibly". After 180 iterations of this command:
./gradlew :server:integTest -Dtests.seed=47534FB84D89BB2E -Dtests.class=org.elasticsearch.repositories.fs.FsBlobStoreRepositoryIT -Dtests.security.manager=true -Dtests.locale=et-EE -Dtests.timezone=Africa/Dar_es_Salaam -Dcompiler.java=11 -Druntime.java=8
I reproduced the failure that occurred here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+internalClusterTest/989/console. However I ran over 1000 iterations of just that one test (same command line plus -Dtests.method=testSnapshotAndRestore
) without a single failure.
The presenting complaint is that the cluster failed to get to green health:
FAILURE 32.7s | FsBlobStoreRepositoryIT.testSnapshotAndRestore <<< FAILURES!
> Throwable #1: java.lang.AssertionError: timed out waiting for green state
> at __randomizedtesting.SeedInfo.seed([47534FB84D89BB2E:87342CCB90464619]:0)
> at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:975)
> at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:931)
> at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:920)
> at org.elasticsearch.repositories.blobstore.ESBlobStoreRepositoryIntegTestCase.testSnapshotAndRestore(ESBlobStoreRepositoryIntegTestCase.java:136)
> at java.lang.Thread.run(Thread.java:748)
This in turn is because peer recoveries were persistently failing:
1> [2019-02-22T07:40:30,000][WARN ][o.e.i.c.IndicesClusterStateService] [node_s1] [[efr][0]] marking and sending shard failed due to [failed recovery]
1> org.elasticsearch.indices.recovery.RecoveryFailedException: [efr][0]: Recovery failed from {node_s0}{VofCxu6oSmuljC2VrCOgWw}{3PNCN7QaTiS_dJ3Ev-Qyaw}{127.0.0.1}{127.0.0.1:35411} into {node_s1}{aJuFo5EUSCynoCzNS60dnA}{E_NgEcNuR6O-skD7Aaar5Q}{127.0.0.1}{127.0.0.1:34247}
1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:253) ~[main/:?]
1> at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:298) ~[main/:?]
1> at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[main/:?]
1> at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[main/:?]
1> at org.elasticsearch.transport.TcpTransport.lambda$handleException$24(TcpTransport.java:1001) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [main/:?]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
1> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_s0][127.0.0.1:35411][internal:index/shard/recovery/start_recovery]
1> Caused by: java.lang.IllegalStateException: translog replay failed to cover required sequence numbers (required range [67:67). first missing op is [67]
1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$phase2$25(RecoverySourceHandler.java:592) ~[main/:?]
1> at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.sendBatch(RecoverySourceHandler.java:648) ~[main/:?]
1> at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$sendBatch$26(RecoverySourceHandler.java:634) ~[main/:?]
1> at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
1> at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.lambda$indexTranslogOperations$4(RemoteRecoveryTargetHandler.java:133) ~[main/:?]
1> at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
1> at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) ~[main/:?]
1> at org.elasticsearch.transport.PlainTransportFuture.handleResponse(PlainTransportFuture.java:87) ~[main/:?]
1> at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1108) ~[main/:?]
1> at org.elasticsearch.transport.TcpTransport$1.doRun(TcpTransport.java:975) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
1> ... 3 more
Here are the full logs from the failure, including some TRACE
-level ones: fail.log.gz
/cc @benwtrent - I couldn't find another issue about this but please correct me if I missed it.
Metadata
Metadata
Assignees
Labels
:Distributed Coordination/Snapshot/RestoreAnything directly related to the `_snapshot/*` APIsAnything directly related to the `_snapshot/*` APIs>test-failureTriaged test failures from CITriaged test failures from CIblockerv7.2.0v8.0.0-alpha1