-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle shard over allocation during partial zone/rack or independent node failures #1149
Conversation
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
✅ Gradle Wrapper Validation success fdd86f6 |
✅ DCO Check Passed fdd86f6 |
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
✅ Gradle Wrapper Validation success c8a2066 |
✅ DCO Check Passed c8a2066 |
✅ Gradle Precommit success c8a2066 |
✅ DCO Check Passed 4561a95547df43c28a8a04b3477e8e421b5730b3 |
✅ Gradle Wrapper Validation success 4561a95547df43c28a8a04b3477e8e421b5730b3 |
✅ Gradle Precommit success 4561a95547df43c28a8a04b3477e8e421b5730b3 |
✅ Gradle Wrapper Validation success 1a2a6a4ed37c5be73f75fbfc9dca050b3be145d2 |
✅ DCO Check Passed 1a2a6a4ed37c5be73f75fbfc9dca050b3be145d2 |
❌ Gradle Precommit failure 1a2a6a4ed37c5be73f75fbfc9dca050b3be145d2 |
✅ DCO Check Passed cd20052d7d1b7dcf4089a96159602ab65c38d067 |
✅ Gradle Wrapper Validation success cd20052d7d1b7dcf4089a96159602ab65c38d067 |
✅ Gradle Precommit success cd20052d7d1b7dcf4089a96159602ab65c38d067 |
✅ DCO Check Passed 4eaf4b5c9773a73472a0f64a689b526f4688d7a2 |
✅ Gradle Wrapper Validation success 4eaf4b5c9773a73472a0f64a689b526f4688d7a2 |
✅ DCO Check Passed 9d29e79aa55b0832dfeef505e7b38208a13dabad |
✅ Gradle Wrapper Validation success 9d29e79aa55b0832dfeef505e7b38208a13dabad |
✅ Gradle Precommit success 4eaf4b5c9773a73472a0f64a689b526f4688d7a2 |
✅ Gradle Precommit success 9d29e79aa55b0832dfeef505e7b38208a13dabad |
✅ Gradle Wrapper Validation success 00c7f10e1962c668f1aa007240fbba0da53d0f8e |
✅ DCO Check Passed 00c7f10e1962c668f1aa007240fbba0da53d0f8e |
✅ Gradle Precommit success 00c7f10e1962c668f1aa007240fbba0da53d0f8e |
✅ DCO Check Passed 6ce9f9b477afe599889c2e7ac9d38df9557d2c30 |
✅ Gradle Wrapper Validation success 6ce9f9b477afe599889c2e7ac9d38df9557d2c30 |
✅ Gradle Precommit success 7bfb392 |
...n/java/org/opensearch/cluster/routing/allocation/decider/NodeLoadAwareAllocationDecider.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. One overall comment is to refactor the tests. Those are really lengthy and painful to review and will be hard to maintain.
* due to node failures or otherwise on the surviving nodes. The allocation limits | ||
* are decided by the user provisioned capacity, to determine if there were lost nodes | ||
* <pre> | ||
* cluster.routing.allocation.overload_awareness.provisioned_capacity: N |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please document the expectation from admin that this setting is supposed to be updated whenever the cluster is scaled up or down?
...n/java/org/opensearch/cluster/routing/allocation/decider/NodeLoadAwareAllocationDecider.java
Outdated
Show resolved
Hide resolved
...er/src/test/java/org/opensearch/cluster/routing/allocation/NodeLoadAwareAllocationTests.java
Outdated
Show resolved
Hide resolved
...er/src/test/java/org/opensearch/cluster/routing/allocation/NodeLoadAwareAllocationTests.java
Outdated
Show resolved
Hide resolved
...er/src/test/java/org/opensearch/cluster/routing/allocation/NodeLoadAwareAllocationTests.java
Outdated
Show resolved
Hide resolved
...er/src/test/java/org/opensearch/cluster/routing/allocation/NodeLoadAwareAllocationTests.java
Outdated
Show resolved
Hide resolved
...er/src/test/java/org/opensearch/cluster/routing/allocation/NodeLoadAwareAllocationTests.java
Outdated
Show resolved
Hide resolved
...n/java/org/opensearch/cluster/routing/allocation/decider/NodeLoadAwareAllocationDecider.java
Show resolved
Hide resolved
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
✅ Gradle Wrapper Validation success 24f29ee |
✅ DCO Check Passed 24f29ee |
✅ Gradle Precommit success 24f29ee |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice refactor, thanks.
@@ -25,7 +25,9 @@ | |||
/** | |||
* This {@link NodeLoadAwareAllocationDecider} controls shard over-allocation | |||
* due to node failures or otherwise on the surviving nodes. The allocation limits | |||
* are decided by the user provisioned capacity, to determine if there were lost nodes | |||
* are decided by the user provisioned capacity, to determine if there were lost nodes. | |||
* The provisioned capacity as defined by the below settings needs to updated one every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
@@ -323,20 +278,14 @@ public void testExistingPrimariesAllocationOnOverload() { | |||
assertThat(newState.getRoutingNodes().node("node4").size(), equalTo(12)); | |||
|
|||
logger.info("--> Remove node4 from zone holding primaries"); | |||
newState = removeNode(newState, "node4", strategy); | |||
newState = removeNodes(newState, strategy,"node4"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after ,
✅ DCO Check Passed 38cfe08 |
✅ Gradle Wrapper Validation success 38cfe08 |
✅ Gradle Precommit success 38cfe08 |
start gradle check |
1 similar comment
start gradle check |
…node failures (opensearch-project#1149) The changes ensure that in the event of a partial zone failure, the surviving nodes in the minority zone don't get overloaded with shards, this is governed by a skewness limit. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
…ndependent … (#1268) * Handle shard over allocation during partial zone/rack or independent node failures (#1149) The changes ensure that in the event of a partial zone failure, the surviving nodes in the minority zone don't get overloaded with shards, this is governed by a skewness limit. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up imports Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up imports Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up imports Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up check style Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Signed-off-by: Bukhtawar Khan bukhtawa@amazon.com
Description
The changes ensure that in the event of a partial zone failure, the surviving nodes in the minority zone don't get overloaded with shards, this is governed by a skewness limit.
Issues Resolved
#938
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.