-
Notifications
You must be signed in to change notification settings - Fork 566
HDDS-13694. Container Balancer Stop Command Fails with Error as Already Stopped #9047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@siddhantsangwan Please review this PR. |
lock.lock(); | ||
try { | ||
validateState(true); | ||
validateEligibility(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we write this logic something like below
public void stopBalancer()
throws IOException, IllegalContainerBalancerStateException {
Thread balancingThread = null;
lock.lock();
try {
if (!isBalancerRunning()) {
LOG.info("ContainerBalancer already stopped;");
return;
}
validateEligibility();
saveConfiguration(config, false, 0);
LOG.info("Trying to stop ContainerBalancer service.");
task.stop();
balancingThread = currentBalancingThread;
} finally {
lock.unlock();
}
if (balancingThread != null) {
blockTillTaskStop(balancingThread);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddhantsangwan What's your take on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@priyeshkaratha Please check the JIRA description for the proposed change. The idea is to make API idempotent and also fix a race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@priyeshkaratha I had created this jira to explain the race - https://issues.apache.org/jira/browse/HDDS-13698.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with the changes but my question if its already stopped do we really need to check validateEligibility? Instead can we return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because the in-memory state of being stopped may not be the correct state as I've described in the jira.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments on the tests.
.../integration-test/src/test/java/org/apache/hadoop/ozone/TestContainerBalancerOperations.java
Show resolved
Hide resolved
...r-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java
Outdated
Show resolved
Hide resolved
...r-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java
Outdated
Show resolved
Hide resolved
@sarvekshayr please check if the test failures are related |
No, the test failures are not related. Please retrigger the failed ones. |
What changes were proposed in this pull request?
stopBalancer
should be idempotent - meaning that if the balancer is already in a STOPPED state, invoking the stop command again should simply return successfully with exit code 0 and perform no action.Currently, it instead throws the following error:
This behavior is fixed so that repeated stop calls are handled gracefully.
What is the link to the Apache JIRA
HDDS-13694
How was this patch tested?
Tested in
TestContainerBalancer#testStartBalancerStop
.CI: https://github.com/sarvekshayr/ozone/actions/runs/17828304300