-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-28206 [JDK17] JVM crashes intermittently on aarch64 #5561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Apache9
approved these changes
Dec 6, 2023
🎊 +1 overall
This message was automatically generated. |
bbeaudreault
added a commit
that referenced
this pull request
Dec 6, 2023
Signed-off-by: Duo Zhang <zhangduo@apache.org>
bbeaudreault
added a commit
that referenced
this pull request
Dec 6, 2023
Signed-off-by: Duo Zhang <zhangduo@apache.org>
bbeaudreault
added a commit
that referenced
this pull request
Dec 6, 2023
Signed-off-by: Duo Zhang <zhangduo@apache.org>
bbeaudreault
added a commit
that referenced
this pull request
Dec 6, 2023
Signed-off-by: Duo Zhang <zhangduo@apache.org>
bbeaudreault
added a commit
to HubSpot/hbase
that referenced
this pull request
Dec 6, 2023
…arch64 (apache#5561) Signed-off-by: Duo Zhang <zhangduo@apache.org>
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
kadirozde
pushed a commit
to kadirozde/hbase
that referenced
this pull request
Jan 5, 2024
Signed-off-by: Duo Zhang <zhangduo@apache.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We routinely see JVM crashes originating from our RegionServerMetricsWrapperRunnable in jdk17 on aarch64. Since that method is so large, it's hard to know exactly where or what the problem is. There is an issue submitted, but no indication of whether or when a fix will come.
The core dump indicates its a problem with On Stack Replacement (OSR). Reading up on that, it can kick in when long methods are suddenly deemed hot and attempt to be optimized in place. Taking a stab in the dark based on that, I saw two potential issues in RegionServerMetricsWrapperRunnable:
This PR attempts to solve both issues:
a. regionserver-level calculations are left in the run() method
b. the main
aggregate()
method of RegionMetricAggregate loops and collects region-level metricsc. for each region, an
aggregateStores()
method is called which loops and collects store-level metrics.This may not be perfect, but I think it's an improvement for three reasons:
In order to ensure this refactor did not break any metrics, I wrote an exhaustive unit test. It validates each of the aggregated getters, and succeeds against both the old and new implementation.