Add allocate_all_primaries to cluster reroute #4285

nik9000 · 2013-11-27T17:21:10Z

From the docs:
allocate_all_primaries::
Allocate all unallocated primaries to any node that can take them.
Accepts no parameters. Each allocation is similar to running allocate
with allow_primary so this can cause data loss. This is useful in the
same cases as allocate with allow_primary but doesn't require looking
up the index or shard or guessing an appropriate node.

Closes #4206

nik9000 · 2013-11-27T17:25:18Z

I've confirmed this works using the local gateway:

Start two nodes
Execute:

curl -XDELETE "http://localhost:9200/test?pretty" -s
curl -XPOST "http://localhost:9200/test?pretty" -s -d '{
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 0
    }
  }
}'
for i in {1..100}; do 
  curl -XPOST "http://localhost:9200/test/test?pretty" -d '{"foo": "1"}' -s
done

Shut down the node at localhost:9201. Wait for a few seconds.
Execute the below and notice the timeouts. ctrl-c it when you are bored.

for i in {1..100}; do 
  curl -XPOST "http://localhost:9200/test/test?pretty" -d '{"foo": "1"}' -s
done

Execute this:

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
  "commands" : [
    {
      "allocate_all_primaries" : {}
    }
  ]
}'

Now this will work without timeouts:

for i in {1..100}; do 
  curl -XPOST "http://localhost:9200/test/test?pretty" -d '{"foo": "1"}' -s
done

The data is lost but at least you don't have timeouts.

Github's markup is making a mess of this....

kimchy · 2013-12-07T19:24:19Z

It would be interested to check somehow if the primary allocation is just being throttled from being allocated to a node, and in which case, not force the allocation.... . This will require to shard knowledge somehow with LocalGatewayAllocator (in case of local gateway, somehow, we need to take into account the gateway abstraction, maybe have a method that will give the node for a primary shard, and then check the decider on it).

nik9000 · 2013-12-09T15:07:34Z

@kimchy, I understand what you are saying but I'm not sure how I'd go about it. It does make me think of something else: will this force allocation and ignore throttling? Is that OK if we're allocating thousands of shards?

I can have a look at implementing what you mention sometime in the next few days.

kimchy · 2013-12-09T15:41:36Z

@nik9000 this force allocation will not end up ignoring throttling, it will just come back to being allocated and respect throttling.

nik9000 · 2013-12-09T16:04:25Z

That, at least, is great news. I can imagine folks in a disaster repeatedly trying this over and over again which won't help. I'll make sure that it refuses to do anything if all the unallocated primaries are throttled. I'll see about spitting out a different error message in that case so people know that all shards are in the process of being allocated.

nik9000 · 2013-12-10T20:54:36Z

(in case of local gateway, somehow, we need to take into account the gateway abstraction, maybe have a method that will give the node for a primary shard, and then check the decider on it).

So I had a look at this and I'm not really sure how to do this because the decision about which node to assign the shard comes after allocation commands are run. I wonder if it'd be simpler to store the list of throttled shards in the cluster state and dig it back out again during the allocation command....

kimchy · 2013-12-10T20:56:08Z

To be honest, I don't have a good idea about how to do it yet as well :), I will try and spend some time thinking about it and provide feedback soonish (sorry!).

nik9000 · 2013-12-10T21:35:36Z

I thought I could get this from the AllocationExplanation on ClusterState but that always seems to be empty. I actually can't find any code that sets it.

From the docs: `allocate_all_primaries`:: Allocate all unallocated primaries to any node that can take them. Accepts no parameters. Each allocation is similar to running `allocate` with `allow_primary` so this can cause data loss. This is useful in the same cases as `allocate` with `allow_primary` but doesn't require looking up the `index` or `shard` or guessing an appropriate `node`. Closes elastic#4206

nik9000 · 2013-12-10T22:06:46Z

Pushed revised version - doesn't do what @kimchy wanted yet but is a bit nicer any way.

nik9000 · 2014-03-05T16:53:46Z

I haven't looked at this in a long while. I imagine this would still be useful but don't have much time to think about it recently. Any interest in me resurrecting this?

manologarciagarcia · 2014-04-24T09:21:06Z

I have exactly this problem, I have just one shard and sometimes when I restart and look at the health of my cluster, I get this for one of my indexes:

http://pastebin.com/Tq08vep1

I know that if I delete the index, the problem will go away, but that's not the optimal solution.

Is there a solution for this problem? Are this changes here a solution for my problem?

Thanks

d1nsh · 2014-06-05T00:12:08Z

Any plans of merging this? We run into issues with "unassigned shards" occasionally and it would be great to have a feature like this.

martijnvg · 2014-08-08T11:13:18Z

@nik9000 Is this still on your radar? I think this new allocation command is useful.

Just thinking out loud here about how to detect if a node is throttling the primary shard allocation:

The LocalGatewayAllocator#buildShardStates() logic can be moved to a public helper class, on top of this there can be an additional method that just returns the DiscoveryNode that has the highest shard version.
Then in AllocateAllPrimariesAllocationCommand#execute() there can be somewhat of the following logic:

boolean found = false;
for (MutableShardRouting routing : allocation.routingNodes().unassigned()) {
    DiscoveryNode nodeHoldingHigestShardVersion = newHelper.findNodeWithHighestShardVersion();
    Decision decision = Decision.YES;
    if (nodeHoldingHigestShardVersion != null) {
        RoutingNode routingNode = allocation.routingNodes().node(nodeHoldingHigestShardVersion.id());
        decision = allocation.deciders().canAllocate(routing, routingNode, allocation);
    }
    if (decision.type() != Decision.Type.THROTTLE && routing.primary()) {
        found = true;
        // Just clear the post allocation flag to the shard so it'll assign itself.
        allocation.routingNodes().addClearPostAllocationFlag(routing.shardId());
    }
}

if (!found) {
    throw new ElasticsearchIllegalArgumentException("[allocate_all_primaries] no unassigned primaries");
}

This way throttled primary allocation will not be affected by the new command.

nik9000 · 2014-09-04T22:59:17Z

This has sunk pretty low on my radar. So low I haven't actually been checking the status and the ping must have slipped by me. I can pick it up at some point but if you want it quickly maybe you can grab it? If my code is a good starting point you can have it. Or start over - I won't be offended - the pull request is really stale.

s1monw · 2014-09-05T07:23:43Z

@nik9000 I labeled it accordingly such that it won't get forgotten and will be picked up at some point thanks for pinging again.

clintongormley · 2014-10-10T09:48:09Z

The pain in allocating many primary shards is finding a place to put them, so suggestion:

remove the allow_primary flag from allocation
add an allocate_primary action where:
node is optional - if not specified then it chooses the node automatically

martijnvg · 2014-10-10T12:01:00Z

+1 This plan looks good.

On 10 October 2014 11:48, Clinton Gormley notifications@github.com wrote:

The pain in allocating many primary shards is finding a place to put them,
so suggestion:

remove the allow_primary flag from allocation

add an allocate_primary action where:

node is optional - if not specified then it chooses the node
automatically

—
Reply to this email directly or view it on GitHub
#4285 (comment)
.

Met vriendelijke groet,

Martijn van Groningen

joestump · 2014-10-10T15:35:56Z

@nik9000 please allow me to buy you a 🍺 or ☕ next time you're in Portland, OR. Great little improvement to ES right here. 👍

soundofjw · 2015-07-21T20:39:54Z

+1 this would still be great ;)

damm · 2015-07-21T21:32:21Z

+1 really needed.

clintongormley · 2015-07-23T10:22:22Z

@soundofjw @damm what version of Elasticsearch are you using? I asked our support team just a few days ago if they still think that this functionality would be useful. Their response was that, with recent versions, the need for this has pretty much disappeared.

damm · 2015-07-23T18:53:39Z

@clintongormley I'm using 1.7.0; I still have issues where I break out the bash scripts in this pull request. Single node recently; but a month ago on a cluster actually.

Not common but it happens enough that I don't forget it.

soundofjw · 2015-07-23T20:45:40Z

@clintongormley Pretty much same - 1.7.0 as well. There are few times that we need to do this, but it usually happens when setting up a cluster for the first time, or making big changes.

damm · 2015-07-23T23:39:06Z

+1 to making big changes; I had to break this out when I had a cluster that was not allocating based on available space and it was making one node run out of space.

Had to re-route a bunch of data quickly while waiting for Elasticsearch to balance itself out once there was enough free space.

clintongormley · 2015-07-27T10:55:42Z

@soundofjw why would you need this when setting up a cluster for the first time, or making big changes? The only time you should need this is when you lose ALL copies of many shards (primaries and replicas) - and you want to force allocation of new empty shard copies.

ofir-petrushka · 2015-07-27T12:27:36Z

@clintongormley It's like a bricked phone with no factory reset button. (no new index creation, no inserts, no fix button...)
When you put up a new cluster you might not have all the settings right yet, and might reset all nodes at once and/or have no replica set... (also data copy is delayed a lot by default and moves slow..)

For example installing the nodes with a deployment system (ex. chef, puppet, andsible..), you might deploy to all nodes at once since you don't care yet about down times etc. somehow it reaches such a state..

I had that multiply times doing a new cluster setup (redeploying nodes again and again) + after a few hours of work, not sure why.

It should just be a loop of existing commands...

soundofjw · 2015-07-30T22:14:50Z

@clintongormley +1 to what @ofir-petrushka and @damm are saying.

One issue I've seen more than once is when the cluster resets state due to all masters resetting - and then data nodes come and recover shards which are no longer recognized.

You'll see a lot of "# of documents mismatch" in this case.

clintongormley · 2015-08-05T10:41:57Z

@soundofjw

One issue I've seen more than once is when the cluster resets state due to all masters resetting - and then data nodes come and recover shards which are no longer recognized.

This issue should be fixed in 2.0 with #9952

soundofjw · 2015-08-05T17:03:41Z

@clintongormley Awesome! That's great news 👍

damm · 2015-11-28T08:11:44Z

@clintongormley just hit this with 2.1 :/

clintongormley · 2015-11-28T13:45:14Z

@damm you want to be more specific?

damm · 2015-11-28T21:06:18Z

@clintongormley had to reroute all my primary shards after a failed 2.1 upgrade from 2.0

Had to modify the scripts to make it happy.

clintongormley · 2015-11-30T07:08:04Z

@damm I'm much more interested in why the 2.1 upgrade failed for you. Was it something wrong with 2.1 or something that you did? If the former, please open a separate issue explaining the problem.

clintongormley · 2016-03-08T13:03:25Z

I'm going to close this PR as it is way out of date, and I think that the use for it is now infrequent.

joestump mentioned this pull request Apr 18, 2014

Cluster reroute api should have a way to assign all unassigned shards #4206

Closed

clintongormley added the review label Jul 11, 2014

martijnvg self-assigned this Aug 8, 2014

clintongormley added Lucene 4.2 Upgrade and removed review labels Aug 8, 2014

s1monw added adoptme labels Sep 5, 2014

clintongormley removed the discuss label Oct 10, 2014

clintongormley added the :Allocation label Nov 11, 2014

drewr force-pushed the master branch from dcc3da0 to 7c20a8a Compare February 20, 2015 16:48

clintongormley mentioned this pull request Jul 27, 2015

Provide a script or command to recover red cluster after data loss #12466

Closed

martijnvg removed their assignment Jan 21, 2016

clintongormley closed this Mar 8, 2016

lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018

Add allocate_all_primaries to cluster reroute #4285

Add allocate_all_primaries to cluster reroute #4285

Uh oh!

Conversation

nik9000 commented Nov 27, 2013

Uh oh!

nik9000 commented Nov 27, 2013

Uh oh!

kimchy commented Dec 7, 2013

Uh oh!

nik9000 commented Dec 9, 2013

Uh oh!

kimchy commented Dec 9, 2013

Uh oh!

nik9000 commented Dec 9, 2013

Uh oh!

nik9000 commented Dec 10, 2013

Uh oh!

kimchy commented Dec 10, 2013

Uh oh!

nik9000 commented Dec 10, 2013

Uh oh!

nik9000 commented Dec 10, 2013

Uh oh!

nik9000 commented Mar 5, 2014

Uh oh!

manologarciagarcia commented Apr 24, 2014

Uh oh!

d1nsh commented Jun 5, 2014

Uh oh!

martijnvg commented Aug 8, 2014

Uh oh!

nik9000 commented Sep 4, 2014

Uh oh!

s1monw commented Sep 5, 2014

Uh oh!

clintongormley commented Oct 10, 2014

Uh oh!

martijnvg commented Oct 10, 2014

Uh oh!

joestump commented Oct 10, 2014

Uh oh!

soundofjw commented Jul 21, 2015

Uh oh!

damm commented Jul 21, 2015

Uh oh!

clintongormley commented Jul 23, 2015

Uh oh!

damm commented Jul 23, 2015

Uh oh!

soundofjw commented Jul 23, 2015

Uh oh!

damm commented Jul 23, 2015

Uh oh!

clintongormley commented Jul 27, 2015

Uh oh!

ofir-petrushka commented Jul 27, 2015

Uh oh!

soundofjw commented Jul 30, 2015

Uh oh!

clintongormley commented Aug 5, 2015

Uh oh!

soundofjw commented Aug 5, 2015

Uh oh!

damm commented Nov 28, 2015

Uh oh!

clintongormley commented Nov 28, 2015

Uh oh!

damm commented Nov 28, 2015

Uh oh!

clintongormley commented Nov 30, 2015

Uh oh!

clintongormley commented Mar 8, 2016

Uh oh!

Uh oh!