-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-27563 ChaosMonkey sometimes generates invalid boundaries for random item selection #4954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-27563 ChaosMonkey sometimes generates invalid boundaries for random item selection #4954
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
||
int startIndex = ThreadLocalRandom.current().nextInt(items.length - selectedNumber); | ||
return originalItems.subList(startIndex, startIndex + selectedNumber); | ||
final int startIndex = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem here is ratio could be greater than 1.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it seems so... and it looks like this patch is not sufficient. Here's the stack trace from branch-2.5.
2023-01-10T19:33:35,651 WARN [ChaosMonkey-0] policies.Policy: Exception occurred during performing action: java.lang.IllegalArgumentException: bound must be positive
at java.base/java.util.Random.nextInt(Random.java:322)
at java.base/java.util.concurrent.ThreadLocalRandom.nextInt(ThreadLocalRandom.java:449)
at org.apache.hadoop.hbase.chaos.monkies.PolicyBasedChaosMonkey.selectRandomItems(PolicyBasedChaosMonkey.java:123)
at org.apache.hadoop.hbase.chaos.actions.RollingBatchRestartRsAction.selectServers(RollingBatchRestartRsAction.java:130)
at org.apache.hadoop.hbase.chaos.actions.RollingBatchRestartRsAction.perform(RollingBatchRestartRsAction.java:75)
at org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:48)
at org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
at org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more likely that I have ratio == length
, we'll have this problem... Let me add some debugging and see what I can learn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd better add some comments here to say why we need these guards? Is it because of the floating point precision?
Checked the code, the ratio is percentage of regionservers we want to restart, normally, so in general it should be a value between 0 and 1. But there is no sanity check in our code... |
|
We have |
This code handles the |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
||
List<T> originalItems = Arrays.asList(items); | ||
final int selectedNumber = (int) Math.ceil(items.length * ratio); | ||
final List<T> originalItems = Arrays.asList(items); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays.asList will create a ArrayList which is backed by the given array, so the later shuffle will change the order of the items array which is passed in as a parameter. If we think this is acceptable, we can just shuffle on the given array instead of wrapping it? If not, I think we should copy the array.
|
||
int startIndex = ThreadLocalRandom.current().nextInt(items.length - selectedNumber); | ||
return originalItems.subList(startIndex, startIndex + selectedNumber); | ||
if (selectedNumber == items.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use >= for safety?
|
||
int startIndex = ThreadLocalRandom.current().nextInt(items.length - selectedNumber); | ||
return originalItems.subList(startIndex, startIndex + selectedNumber); | ||
final int startIndex = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd better add some comments here to say why we need these guards? Is it because of the floating point precision?
Good points all around, thanks @Apache9 |
@@ -151,7 +157,9 @@ public boolean isStopped() { | |||
|
|||
@Override | |||
public void waitForStop() throws InterruptedException { | |||
monkeyThreadPool.awaitTermination(1, TimeUnit.MINUTES); | |||
if (!monkeyThreadPool.awaitTermination(1, TimeUnit.MINUTES)) { | |||
LOG.warn("Some pool threads failed to terminate, {}", monkeyThreadPool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange. monkeyThreadPool
is not a daemon pool, so if this termination fails for some reason, the process can hang. Let me add a forced shutdown.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
final int startIndex = | ||
Math.max(0, ThreadLocalRandom.current().nextInt(items.length - selectedNumber)); | ||
final int endIndex = Math.min(items.length, startIndex + selectedNumber); | ||
return shuffledItems.subList(startIndex, endIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe here you could make use of org.apache.hadoop.hbase.util.ReservoirSample so we do not need to copy the whole array? Not a blocker issue anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's simpler.
5ea901a
to
00ca430
Compare
…ndom item selection Signed-off-by: Duo Zhang <zhangduo@apache.org>
00ca430
to
b02e312
Compare
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Clamp the boundaries of the selected sublist according to the input list size.