Description
Elasticsearch version (bin/elasticsearch --version
): Version: 7.7.0, Build: default/deb/81a1e9eda8e6183f5237786246f6dced26a10eaf/2020-05-12T02:01:37.602180Z, JVM: 14
Plugins installed: []
JVM version (java -version
): (Elasticsearch bundled) openjdk 14 2020-03-17
OS version (uname -a
if on a Unix-like system): Debian 9.12, Linux elastic-n1 4.9.0-12-amd64 1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
At some point a few indexes just didn't want to recover anymore resulting in data loss. I wouldn't want that to happen.
Total of 24 shards in 4 indices failed. All indices were system indices (.ml-state, .ml-notifications, .ml-anomalies-shared, .logstash) and most really small (~1MB, ml-anomalies-shared 250MB)
Steps to reproduce:
Not entirely sure how it happened and when exactly. Might have been caused by earlier corruption or cluster problems.
I had to delete the affected indices to get reporting working. I did take copies of the files before deletion if anybody wants to take a look at them.
(Restored from snapshot) .logstash index reports version.created: 6010199.
Timeline
19:10:01 elastic-n4 automated updates for bind9 cause Debian to toggle interface down & up. This results in DHCP giving new address [which is really a facepalm]
19:10:34 elastic-n4 (master) lost communications with rest of the cluster
19:10:36 elastic-n1 elected as new master
19:14:47 - 19:16:38 elastic-n1 busyloops (30/sec) monitoring exporter failures due to elastic-n2 queue full
19:15:22 - 19:16:39 elastic-n3 reports master lost and elastic-n1 election twice
19:25:45 elastic-n4 starts spamming elastic-n1 with join requests to wrong IP
20:14:41 restarted elastic-n4; communications finally recover
20:25:11 .ml-state recovery from n1 to n4 fails (Commit list must not empty); index yellow
21:36:56 restarted elastic-n1
21:54:39 .ml-state recovery fails on n1 (Commit list must not empty); index red
I did /_flush/synced at some point but don't remember when. Likely before elastic-n4 restart
Provide logs (if relevant):
[2020-05-22T20:25:11,075][WARN ][o.e.c.r.a.AllocationService] [elastic-n1.lan.vilant.com] failing shard [failed shard, shard [.ml-state][2], node[p8qQrkI8Q7ag7Ddt7o6myQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=xE6ysNXISBust7s1fw7ZMg], unassigned_info[[reason=NODE_LEFT], at[2020-05-22T16:11:36.297Z], delayed=false, details[node_left [p8qQrkI8Q7ag7Ddt7o6myQ]], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[.ml-state][2]: Recovery failed from {elastic-n1.lan.vilant.com}{mIA4_vOvRPSHiJMWkpU-jw}{F8yJhCiXSz6w2YmK_wFQEg}{10.6.1.175}{10.6.1.175:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=14690267136, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elastic-n4.lan.vilant.com}{p8qQrkI8Q7ag7Ddt7o6myQ}{ymEcasqPSOqfaozi3W70SA}{10.6.0.162}{10.6.0.162:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=21017317376, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elastic-n1.lan.vilant.com][10.6.1.175:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elastic-n4.lan.vilant.com][10.6.0.162:9300][internal:index/shard/recovery/prepare_translog]]; nested: IllegalArgumentException[Commit list must not empty]; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.ml-state][2]: Recovery failed from {elastic-n1.lan.vilant.com}{mIA4_vOvRPSHiJMWkpU-jw}{F8yJhCiXSz6w2YmK_wFQEg}{10.6.1.175}{10.6.1.175:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=14690267136, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elastic-n4.lan.vilant.com}{p8qQrkI8Q7ag7Ddt7o6myQ}{ymEcasqPSOqfaozi3W70SA}{10.6.0.162}{10.6.0.162:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=21017317376, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:247) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:292) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:244) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.7.0.jar:7.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [elastic-n1.lan.vilant.com][10.6.1.175:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$prepareTargetForTranslog$32(RecoverySourceHandler.java:626) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:173) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.7.0.jar:7.7.0]
... 6 more
Caused by: org.elasticsearch.transport.RemoteTransportException: [elastic-n4.lan.vilant.com][10.6.0.162:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalArgumentException: Commit list must not empty
at org.elasticsearch.index.engine.CombinedDeletionPolicy.findSafeCommitPoint(CombinedDeletionPolicy.java:182) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.index.store.Store.trimUnsafeCommits(Store.java:1524) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.index.engine.InternalEngine.trimUnsafeCommits(InternalEngine.java:2865) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:227) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:200) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1625) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1603) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$prepareForTranslogOperations$0(RecoveryTarget.java:287) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:325) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:285) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:383) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:377) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:306) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:385) ~[?:?]
at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:190) ~[?:?]
at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:129) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:324) ~[?:?]
at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:386) ~[?:?]
at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:397) ~[?:?]
at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:321) ~[?:?]
at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:263) ~[?:?]
at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:174) ~[?:?]
at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:120) ~[?:?]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:313) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.7.0.jar:7.7.0]
... 3 more