-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-28396: Quota throttling can cause a leak of scanners #6055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I don't think the build failure is related to my change |
|
Just ignore the Hadoop 3.4.0 Hadoop check failure. The branch itself is broken. |
|
I'm working on adding a test at the moment that will sanity check the scanner counts |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
|
||
| /** This class tracks a single Lease. */ | ||
| static class Lease implements Delayed { | ||
| protected static class Lease implements Delayed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there's a better way to do this, but I needed to expose this to implement this method to track the leases created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't package protected is a super-set of protected permissions? The LeaseManager$Lease class should already be accessible to anything in the o.a.h.h.regionserver package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, NoopOperationQuota is also protected in the same way LeaseManager is. Originally, this was the reason I placed the test under the quotas package. I considered creating a test OperationQuota class that is a 1:1 copy of NoopOperationQuota, but I didn't love that. Though I'm happy to go that route if it's preferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, I considered tracking the # of scanners created, but the code in RSRpcServices makes it tricky to do. All the relevant methods are private, and I was very reluctant to expose them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an idea, iterating on this now.
| public void itIncreasesScannerCount() throws Exception { | ||
| long beforeScan = countLeases(); | ||
| ResultScanner scanner = TABLE.getScanner(new Scan()); | ||
| Thread.sleep(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to wait until the scan and lease are created server-side. Otherwise, our lease counting will not reflect this new scan that was created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests with fixed timed waits like this are guaranteed to show up on our flakey dashboard. Instead of this, I would loop the beforeScan < afterScan in a retry-with-timeout structure. We have the Waiter class and HBaseCommonTestingUtil#waitFor methods for exactly this purpose. Whatever hacks you required, make sure you include a comment in code, not just on the PR.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@Apache9 I'd appreciate any feedback or comments if you have some time! |
|
@bbeaudreault I think you are more familiar with this area? Do you have sometime to review this? Thanks. |
|
I'm sorry I don't have time to review in the near future. Maybe @rmdmattingly |
|
Ah, never mind. I'm also a bit busy recently, for getting a new job. I will try to find sometime review this PR in the next few days. Thanks. |
|
No rush, I think @ndimiduk is also available next week |
@Apache9 congrats on the new job! |
ndimiduk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please improve the test and check again the package-protected change. The comment about holder vs. context is more of a nice-to-have. Otherwise looks good.
| public void itIncreasesScannerCount() throws Exception { | ||
| long beforeScan = countLeases(); | ||
| ResultScanner scanner = TABLE.getScanner(new Scan()); | ||
| Thread.sleep(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests with fixed timed waits like this are guaranteed to show up on our flakey dashboard. Instead of this, I would loop the beforeScan < afterScan in a retry-with-timeout structure. We have the Waiter class and HBaseCommonTestingUtil#waitFor methods for exactly this purpose. Whatever hacks you required, make sure you include a comment in code, not just on the PR.
| } | ||
| } | ||
|
|
||
| private long countLeases() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really coarse-grained. Even limiting the check to a specific region server would prove a more reliable test. Extracting the scannerId and looking for it specifically will be the most reliable way to test here. I don't know if we have reasonable utilities to expose that information on the client through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the test to be less coarse-grained
|
|
||
| /** This class tracks a single Lease. */ | ||
| static class Lease implements Delayed { | ||
| protected static class Lease implements Delayed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't package protected is a super-set of protected permissions? The LeaseManager$Lease class should already be accessible to anything in the o.a.h.h.regionserver package.
| private final RegionScannerHolder rsh; | ||
| private final OperationQuota quota; | ||
|
|
||
| QuotaAndRegionScannerHolder(String scannerName, RegionScannerHolder rsh, OperationQuota quota) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of wrapping the existing holder class, does it make sense to push the Quota and name information down into the RegionScannerHolder? I imagine that there's other related places where things could be cleaner if the Quota was automagically along for the ride. Maybe rename it to a Context objects instead of "holder".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, I've refactored the code a bit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only oddity here is that now the context may hold a Quota that isn't up-to-date because updating the RegionScannerContext's quota isn't thread-safe. There are other variables of the RSX that are updated this way though, so maybe this isn't an actual issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't sound great. Maybe this is why the quota isn't already in that context object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even if we make update the context to hold an AtomicReference<OperationQuota>, it's a bit odd for there to be two sources of truth for the quota. The first being the RpcQuotaManager and the second being the cached scanner context.
I'm going to roll this change back because I think it adds more complexities and isn't solving any pertinent issues at the moment
a8b3200 to
19329d1
Compare
This comment has been minimized.
This comment has been minimized.
19329d1 to
f626549
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
These test failures don't seem related to my changes. Should I push an empty commit to trigger a build? |
Nope, I can do it. If you have a moment and want to dig into that test failure though... |
Happy to do that! |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Sadly seems like unique tests are failing on each run, mostly related to timeouts. One of the tests timed out when waiting on a call to HBaseTestingUtil#waitTableAvailable. Is it possible that the test environment is in some precarious state at the moment? |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
Yeah generally we have something that causes master initialization to take a long time. I'd love to be able to track it down, but i've not yet found a thread dump that showed where the issue is. |
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
… (addendum for revert of HBASE-28346)
HubSpot Backport: HBASE-28396: Quota throttling can cause a leak of scanners (apache#6055)
…canners (apache#6055) (#111) * HBASE-28396 Quota throttling can cause a leak of scanners (apache#6055) Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org> * fix --------- Signed-off-by: Nick Dimiduk <ndimiduk@apache.org> Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
We've noticed memory issues during intense periods of quota throttling. This happens because new scanners are created prior to the acquisition of the throttle, and aren't cleaned up correctly in the case acquiring the throttle throws an exception. This change creates a new region scanner only if the quota was acquired successfully to avoid scanner leaks.