Skip to content

Search failing when cluster busy #99

Closed
@clintongormley

Description

@clintongormley

Hiya Shay

It turns out that the issue I was having earlier with NFS was a red herring. What seems to be happening is:

My process:

  • i'm reindexing old_index to new_index
  • i read 5000 docs from the old index, then create each one in the new index
  • if there is an error, then i delete new_index

So:

  • the cluster gets busy, and a search for the next 5,000 docs results in this error: select failed: No child processes.

  • This was triggering the cleanup in my script which deleted the index.

  • It appears the index has been deleted by one node, while another node is still trying to write snapshot info for the (now deleted) index, which results in these errors:

    [14:48:09,948][WARN ][index.gateway ] [Nameless One][ia_object_1270046679][0] Failed to snapshot on close
    org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException: [ia_object_1270046679][0] Failed to append snapshot translog into [/opt/elasticsearch/data/iAnnounce/ia_object_1270046679/0/translog/translog-3]
    at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:199)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:154)
    at org.elasticsearch.index.engine.robin.RobinEngine.snapshot(RobinEngine.java:350)
    at org.elasticsearch.index.shard.service.InternalIndexShard.snapshot(InternalIndexShard.java:369)
    at org.elasticsearch.index.gateway.IndexShardGatewayService.snapshot(IndexShardGatewayService.java:150)
    at org.elasticsearch.index.gateway.IndexShardGatewayService.close(IndexShardGatewayService.java:176)
    at org.elasticsearch.index.service.InternalIndexService.deleteShard(InternalIndexService.java:244)
    at org.elasticsearch.index.service.InternalIndexService.close(InternalIndexService.java:159)
    at org.elasticsearch.indices.InternalIndicesService.deleteIndex(InternalIndicesService.java:208)
    at org.elasticsearch.indices.InternalIndicesService.deleteIndex(InternalIndicesService.java:185)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.google.inject.internal.ConstructionContext$DelegatingInvocationHandler.invoke(ConstructionContext.java:108)
    at $Proxy19.deleteIndex(Unknown Source)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:178)
    at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:193)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:619)
    Caused by: java.io.FileNotFoundException: /opt/elasticsearch/data/iAnnounce/ia_object_1270046679/0/translog/translog-3 (Stale NFS file handle)
    at java.io.RandomAccessFile.open(Native Method)
    at java.io.RandomAccessFile.(RandomAccessFile.java:212)
    at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:184)
    ... 20 more

Now, I'm catching the select failed: No child processes errors, sleeping for a few seconds, then trying again, and everything is working well.

ta

clint

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions