Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make StickyRebalanceStrategy topology aware #2944

Open
wants to merge 4 commits into
base: helix-gateway-service
Choose a base branch
from

Conversation

frankmu
Copy link
Contributor

@frankmu frankmu commented Oct 9, 2024

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

resolves #2822

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Previous StickyRebalanceStrategy leverage greedy approach for assignment. This PR make it topology aware, so for each partition, we will try to assign it's replicas to different fault zones.

Tests

  • The following tests are written for this issue:

TestStickyRebalanceStrategy.testNoSameZoneAssignment()

mvn test -o -Dtest=TestStickyRebalanceStrategy -pl=helix-core
...
[INFO] Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 75.78 s - in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 11, Failures: 0, Errors: 0, Skipped: 0
...
mvn test -o -Dtest=TestStickyRebalanceWithGlobalPerInstancePartitionLimit -pl=helix-core
...
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.209 s - in org.apache.helix.integration.rebalancer.TestStickyRebalanceWithGlobalPerInstancePartitionLimit
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)


/**
* Get the fault zone of this node
* @return The ID of this node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return the fault zone?

* For example, when
* the domain is "zone=2, instance=testInstance" and the fault zone type is "zone", this function
* returns "2".
* If cannot find the fault zone type, this function leaves the fault zone id as the instance name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use logical ID as default fault zone if we can't find the fault zone type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is copied from the waged rebalancer: https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/controller/rebalancer/waged/model/AssignableNode.java#L363-L383

Any pointers on why logical ID is a better candidate here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of swap, using logic id will make the swap in/out instances in the same fault zone. But I also agree that this should align with other rebalancers' behavior.
What do you think about if we add a todo and consider changing both places later?

}
for (int j = 0; j < remainingReplica; j++) {
while (index - startIndex < assignableNodeList.size()) {
CapacityNode node = assignableNodeList.get(index++ % assignableNodeList.size());
if (node.canAdd(_resourceName, _partitions.get(i))) {
stateMap.computeIfAbsent(_partitions.get(i), m -> new HashSet<>()).add(node.getId());
// Valid assignment when following conditions match:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest move the logic to a separate function for extendability?

* TODO: change the return value to logical id when no fault zone type found. Also do the same for
* waged rebalancer in helix-core/src/main/java/org/apache/helix/controller/rebalancer/waged/model/AssignableNode.java
*/
private String computeFaultZone(ClusterConfig clusterConfig, InstanceConfig instanceConfig) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to create a util class with this logic instead of copying it in two places and having to maintain both methods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth while to create a Node base class(different than the one under topology package) that extracts common logic from AssignableNode and CapacityNode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I prefer to do in later iterations: consolidate CapacityNode + AssignableNode and apply the changes @xyuanlu mentioned above.

@@ -573,11 +575,16 @@ public WagedInstanceCapacity getWagedInstanceCapacity() {
return _wagedInstanceCapacity;
}

private void buildSimpleCapacityMap(int globalMaxPartitionAllowedPerInstance) {
private void buildSimpleCapacityMap() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a general comment that we don't need to address right away, but we may want to address before merging this to master.

Should we building the capacity nodes in the resource controller data provider? It seems like an implementation detail for one specific rebalance strategy. If there are no resources in the cluster which use StickyRebalanceStrategy then this is not needed. Why don't we move the creation of simpleCapacitySet to the rebalance strategy?

If this is here because StickyRebalanceStrategy relies on global node capacity, maybe we should be implementing StatefulRebalancer interface instead of using single resources rebalance. That has method computeNewIdealStates for all resources and is intended to be implemented for global rebalancer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is here because StickyRebalanceStrategy relies on global node capacity

Yeah, this is exact the reason - we need to populate the node usage globally before the rebalance stage. StatefulRebalancer seems for rebalancer, but here we need the rebalance strategy. To me the data provider is the glue for rebalancer and rebalance strategy.

for (String instance : getEnabledLiveInstances()) {
CapacityNode capacityNode = new CapacityNode(instance);
capacityNode.setCapacity(globalMaxPartitionAllowedPerInstance);
for (String instanceName : getEnabledLiveInstances()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't disabled and down instances get included in the capacitySet? This will cause all the replicas on the down or disabled nodes to be reassigned somewhere else in the cluster if delayed rebalance is being used. We should have all assignable instances in the map and leave it to the rebalancer to pass down liveNodes to computePartitionAssignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause all the replicas on the down or disabled nodes to be reassigned somewhere else in the cluster if delayed rebalance is being used

If the node is disabled or down, do we expect to re-assign the replica to other nodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is determined by the rebalancer and not rebalance strategy. If you use DelayedAutoRebalancer it should not try to reassign the replicas to other nodes until after the delay window. If you use AutoRebalancer, it should.

If we only have EnabledLive in capacitySet, we will make the behavior as I described for AutoRebalancer even when DelayedAutoRebalancer is used. This will be inconsistent behavior with assumption of how each rebalancer should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you, thanks for the detailed context, will make the changes

Copy link
Contributor

@zpinto zpinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM for now. I think that in a later PR we need to revisit the following comments and address them more cleanly:
#2944 (comment)
#2944 (comment)

Also, have a concern on the test case. Please take a look.

@Test
public void testNoSameZoneAssignment() throws Exception {
setTopologyAwareAndGlobalMaxPartitionAllowedPerInstanceInCluster(CLUSTER_NAME, 1);
Map<String, ExternalView> externalViews = createTestDBs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does createTestDBs use the verifier to make sure the controller thread is done computing the assignment before starting the assertions? If not, this should be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean by following?

Assert.assertTrue(_clusterVerifier.verifyByPolling());

If yes, then yes and here is the code pointer: https://github.com/apache/helix/blob/helix-gateway-service/helix-core/src/test/java/org/apache/helix/integration/rebalancer/TestStickyRebalanceStrategy.java#L332

}

/**
* Constructor used for non-topology-aware use case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is for topology aware?
also can we have one constructor call another?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will make this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants