Skip to content

FsBlobStoreRepositoryIT#testSnapshotRestore fails reproducibly #39299

@DaveCTurner

Description

@DaveCTurner

... for some value of the word "reproducibly". After 180 iterations of this command:

./gradlew :server:integTest -Dtests.seed=47534FB84D89BB2E -Dtests.class=org.elasticsearch.repositories.fs.FsBlobStoreRepositoryIT -Dtests.security.manager=true -Dtests.locale=et-EE -Dtests.timezone=Africa/Dar_es_Salaam -Dcompiler.java=11 -Druntime.java=8

I reproduced the failure that occurred here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+internalClusterTest/989/console. However I ran over 1000 iterations of just that one test (same command line plus -Dtests.method=testSnapshotAndRestore) without a single failure.

The presenting complaint is that the cluster failed to get to green health:

FAILURE 32.7s | FsBlobStoreRepositoryIT.testSnapshotAndRestore <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: timed out waiting for green state
   >    at __randomizedtesting.SeedInfo.seed([47534FB84D89BB2E:87342CCB90464619]:0)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:975)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:931)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:920)
   >    at org.elasticsearch.repositories.blobstore.ESBlobStoreRepositoryIntegTestCase.testSnapshotAndRestore(ESBlobStoreRepositoryIntegTestCase.java:136)
   >    at java.lang.Thread.run(Thread.java:748)

This in turn is because peer recoveries were persistently failing:

  1> [2019-02-22T07:40:30,000][WARN ][o.e.i.c.IndicesClusterStateService] [node_s1] [[efr][0]] marking and sending shard failed due to [failed recovery]
  1> org.elasticsearch.indices.recovery.RecoveryFailedException: [efr][0]: Recovery failed from {node_s0}{VofCxu6oSmuljC2VrCOgWw}{3PNCN7QaTiS_dJ3Ev-Qyaw}{127.0.0.1}{127.0.0.1:35411} into {node_s1}{aJuFo5EUSCynoCzNS60dnA}{E_NgEcNuR6O-skD7Aaar5Q}{127.0.0.1}{127.0.0.1:34247}
  1>    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:253) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:298) ~[main/:?]
  1>    at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[main/:?]
  1>    at org.elasticsearch.transport.TcpTransport.lambda$handleException$24(TcpTransport.java:1001) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [main/:?]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
  1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_s0][127.0.0.1:35411][internal:index/shard/recovery/start_recovery]
  1> Caused by: java.lang.IllegalStateException: translog replay failed to cover required sequence numbers (required range [67:67). first missing op is [67]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$phase2$25(RecoverySourceHandler.java:592) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.sendBatch(RecoverySourceHandler.java:648) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$sendBatch$26(RecoverySourceHandler.java:634) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.lambda$indexTranslogOperations$4(RemoteRecoveryTargetHandler.java:133) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) ~[main/:?]
  1>    at org.elasticsearch.transport.PlainTransportFuture.handleResponse(PlainTransportFuture.java:87) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1108) ~[main/:?]
  1>    at org.elasticsearch.transport.TcpTransport$1.doRun(TcpTransport.java:975) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
  1>    ... 3 more

Here are the full logs from the failure, including some TRACE-level ones: fail.log.gz

/cc @benwtrent - I couldn't find another issue about this but please correct me if I missed it.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions