Skip to content

Conversation

@hgromer
Copy link
Contributor

@hgromer hgromer commented Jul 8, 2024

We've noticed memory issues during intense periods of quota throttling. This happens because new scanners are created prior to the acquisition of the throttle, and aren't cleaned up correctly in the case acquiring the throttle throws an exception. This change creates a new region scanner only if the quota was acquired successfully to avoid scanner leaks.

@hgromer hgromer marked this pull request as ready for review July 8, 2024 14:59
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@hgromer
Copy link
Contributor Author

hgromer commented Jul 8, 2024

I don't think the build failure is related to my change

@Apache9
Copy link
Contributor

Apache9 commented Jul 9, 2024

Just ignore the Hadoop 3.4.0 Hadoop check failure. The branch itself is broken.

@hgromer
Copy link
Contributor Author

hgromer commented Jul 9, 2024

I'm working on adding a test at the moment that will sanity check the scanner counts

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.


/** This class tracks a single Lease. */
static class Lease implements Delayed {
protected static class Lease implements Delayed {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a better way to do this, but I needed to expose this to implement this method to track the leases created

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't package protected is a super-set of protected permissions? The LeaseManager$Lease class should already be accessible to anything in the o.a.h.h.regionserver package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, NoopOperationQuota is also protected in the same way LeaseManager is. Originally, this was the reason I placed the test under the quotas package. I considered creating a test OperationQuota class that is a 1:1 copy of NoopOperationQuota, but I didn't love that. Though I'm happy to go that route if it's preferred

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I considered tracking the # of scanners created, but the code in RSRpcServices makes it tricky to do. All the relevant methods are private, and I was very reluctant to expose them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea, iterating on this now.

public void itIncreasesScannerCount() throws Exception {
long beforeScan = countLeases();
ResultScanner scanner = TABLE.getScanner(new Scan());
Thread.sleep(1000);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to wait until the scan and lease are created server-side. Otherwise, our lease counting will not reflect this new scan that was created

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests with fixed timed waits like this are guaranteed to show up on our flakey dashboard. Instead of this, I would loop the beforeScan < afterScan in a retry-with-timeout structure. We have the Waiter class and HBaseCommonTestingUtil#waitFor methods for exactly this purpose. Whatever hacks you required, make sure you include a comment in code, not just on the PR.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@hgromer
Copy link
Contributor Author

hgromer commented Aug 30, 2024

@Apache9 I'd appreciate any feedback or comments if you have some time!

@Apache9
Copy link
Contributor

Apache9 commented Aug 30, 2024

@bbeaudreault I think you are more familiar with this area? Do you have sometime to review this?

Thanks.

@bbeaudreault
Copy link
Contributor

I'm sorry I don't have time to review in the near future. Maybe @rmdmattingly

@Apache9
Copy link
Contributor

Apache9 commented Aug 30, 2024

Ah, never mind. I'm also a bit busy recently, for getting a new job.

I will try to find sometime review this PR in the next few days.

Thanks.

@hgromer
Copy link
Contributor Author

hgromer commented Aug 30, 2024

No rush, I think @ndimiduk is also available next week

@ndimiduk
Copy link
Member

ndimiduk commented Sep 2, 2024

Ah, never mind. I'm also a bit busy recently, for getting a new job.

@Apache9 congrats on the new job!

Copy link
Member

@ndimiduk ndimiduk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please improve the test and check again the package-protected change. The comment about holder vs. context is more of a nice-to-have. Otherwise looks good.

public void itIncreasesScannerCount() throws Exception {
long beforeScan = countLeases();
ResultScanner scanner = TABLE.getScanner(new Scan());
Thread.sleep(1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests with fixed timed waits like this are guaranteed to show up on our flakey dashboard. Instead of this, I would loop the beforeScan < afterScan in a retry-with-timeout structure. We have the Waiter class and HBaseCommonTestingUtil#waitFor methods for exactly this purpose. Whatever hacks you required, make sure you include a comment in code, not just on the PR.

}
}

private long countLeases() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really coarse-grained. Even limiting the check to a specific region server would prove a more reliable test. Extracting the scannerId and looking for it specifically will be the most reliable way to test here. I don't know if we have reasonable utilities to expose that information on the client through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test to be less coarse-grained


/** This class tracks a single Lease. */
static class Lease implements Delayed {
protected static class Lease implements Delayed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't package protected is a super-set of protected permissions? The LeaseManager$Lease class should already be accessible to anything in the o.a.h.h.regionserver package.

private final RegionScannerHolder rsh;
private final OperationQuota quota;

QuotaAndRegionScannerHolder(String scannerName, RegionScannerHolder rsh, OperationQuota quota) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of wrapping the existing holder class, does it make sense to push the Quota and name information down into the RegionScannerHolder? I imagine that there's other related places where things could be cleaner if the Quota was automagically along for the ride. Maybe rename it to a Context objects instead of "holder".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, I've refactored the code a bit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only oddity here is that now the context may hold a Quota that isn't up-to-date because updating the RegionScannerContext's quota isn't thread-safe. There are other variables of the RSX that are updated this way though, so maybe this isn't an actual issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound great. Maybe this is why the quota isn't already in that context object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even if we make update the context to hold an AtomicReference<OperationQuota>, it's a bit odd for there to be two sources of truth for the quota. The first being the RpcQuotaManager and the second being the cached scanner context.

I'm going to roll this change back because I think it adds more complexities and isn't solving any pertinent issues at the moment

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@hgromer
Copy link
Contributor Author

hgromer commented Sep 11, 2024

These test failures don't seem related to my changes. Should I push an empty commit to trigger a build?

@ndimiduk
Copy link
Member

These test failures don't seem related to my changes. Should I push an empty commit to trigger a build?

Nope, I can do it. If you have a moment and want to dig into that test failure though...

@hgromer
Copy link
Contributor Author

hgromer commented Sep 11, 2024

These test failures don't seem related to my changes. Should I push an empty commit to trigger a build?

Nope, I can do it. If you have a moment and want to dig into that test failure though...

Happy to do that!

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@hgromer
Copy link
Contributor Author

hgromer commented Sep 11, 2024

Sadly seems like unique tests are failing on each run, mostly related to timeouts. One of the tests timed out when waiting on a call to HBaseTestingUtil#waitTableAvailable. Is it possible that the test environment is in some precarious state at the moment?

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 41s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+1 💚 mvninstall 3m 3s master passed
+1 💚 compile 2m 59s master passed
+1 💚 checkstyle 0m 38s master passed
+1 💚 spotbugs 1m 34s master passed
+1 💚 spotless 0m 43s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 2m 57s the patch passed
+1 💚 compile 2m 58s the patch passed
+1 💚 javac 2m 58s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 37s the patch passed
+1 💚 spotbugs 1m 38s the patch passed
+1 💚 hadoopcheck 10m 25s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚 spotless 0m 44s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 11s The patch does not generate ASF License warnings.
35m 35s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6055/15/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #6055
JIRA Issue HBASE-28396
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 2a5c88a88eba 5.4.0-186-generic #206-Ubuntu SMP Fri Apr 26 12:31:10 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f626549
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 85 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6055/15/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 26s Docker mode activated.
-0 ⚠️ yetus 0m 2s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 3m 38s master passed
+1 💚 compile 0m 59s master passed
+1 💚 javadoc 0m 28s master passed
+1 💚 shadedjars 5m 55s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 8s the patch passed
+1 💚 compile 1m 0s the patch passed
+1 💚 javac 1m 0s the patch passed
+1 💚 javadoc 0m 28s the patch passed
+1 💚 shadedjars 5m 51s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 216m 19s hbase-server in the patch passed.
242m 46s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6055/15/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #6055
JIRA Issue HBASE-28396
Optional Tests javac javadoc unit compile shadedjars
uname Linux ea401ec64b99 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f626549
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6055/15/testReport/
Max. process+thread count 4654 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6055/15/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@ndimiduk ndimiduk merged commit f4a999f into apache:master Sep 12, 2024
@ndimiduk
Copy link
Member

Sadly seems like unique tests are failing on each run, mostly related to timeouts. One of the tests timed out when waiting on a call to HBaseTestingUtil#waitTableAvailable. Is it possible that the test environment is in some precarious state at the moment?

Yeah generally we have something that causes master initialization to take a long time. I'd love to be able to track it down, but i've not yet found a thread dump that showed where the issue is.

ndimiduk pushed a commit to ndimiduk/hbase that referenced this pull request Sep 12, 2024
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
ndimiduk pushed a commit that referenced this pull request Sep 13, 2024
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
hgromer added a commit to hgromer/hbase that referenced this pull request Sep 13, 2024
hgromer added a commit to hgromer/hbase that referenced this pull request Sep 13, 2024
ndimiduk pushed a commit that referenced this pull request Sep 16, 2024
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
ndimiduk pushed a commit to ndimiduk/hbase that referenced this pull request Sep 16, 2024
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
ndimiduk pushed a commit that referenced this pull request Sep 16, 2024
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
hgromer added a commit to HubSpot/hbase that referenced this pull request Sep 17, 2024
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
hgromer added a commit to HubSpot/hbase that referenced this pull request Sep 17, 2024
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
ndimiduk added a commit to ndimiduk/hbase that referenced this pull request Oct 4, 2024
hgromer added a commit to HubSpot/hbase that referenced this pull request Oct 10, 2024
HubSpot Backport: HBASE-28396: Quota throttling can cause a leak of scanners (apache#6055)
hgromer added a commit to HubSpot/hbase that referenced this pull request Oct 10, 2024
…canners (apache#6055) (#111)

* HBASE-28396 Quota throttling can cause a leak of scanners (apache#6055)

Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>

* fix

---------

Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
szabowexler pushed a commit to HubSpot/hbase that referenced this pull request Feb 3, 2025
Co-authored-by: Hernan Gelaf-Romer <hgelafromer@hubspot.com>
Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants