Skip to content

Remove all calls to System.gc() in PerQueryCPUMemAccountantFactory#16374

Merged
gortiz merged 4 commits intoapache:masterfrom
vrajat:rv-tracing-remove-gc
Jul 18, 2025
Merged

Remove all calls to System.gc() in PerQueryCPUMemAccountantFactory#16374
gortiz merged 4 commits intoapache:masterfrom
vrajat:rv-tracing-remove-gc

Conversation

@vrajat
Copy link
Contributor

@vrajat vrajat commented Jul 17, 2025

Calls to System.gc() force a GC pause and is not recommended. From experience such a call can worsen the state of the system rather than help.

A small unrelated changed to log if the accountant used a callback to cancel is also in this PR.

Closes #16373

Calls to System.gc() force a GC pause and is not recommended. From experience such a call can worsen the state of the system rather than help.

Closes apache#16373
@vrajat vrajat requested review from Copilot, gortiz and yashmayya and removed request for yashmayya July 17, 2025 13:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes all calls to System.gc() from the PerQueryCPUMemAccountantFactory to improve system performance. Forced garbage collection calls can cause unnecessary GC pauses and may worsen system performance rather than help with memory management.

Key changes:

  • Removed GC-related configuration properties and logic from QueryMonitorConfig
  • Eliminated forced garbage collection calls in query monitoring and termination logic
  • Simplified the query killing mechanism by removing GC backoff counting

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
QueryMonitorConfig.java Removed GC-related configuration fields, constructors, and getter methods
PerQueryCPUMemAccountantFactory.java Removed System.gc() calls and related logic from query monitoring and killing mechanisms

@vrajat vrajat added the bugfix label Jul 17, 2025
Copy link
Contributor

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests added in #16299 are failing.

Copy link
Contributor

@praveenc7 praveenc7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the intention of the change to improve performance. Have we validated through some integ-test or running in a prod like environment with actual queries that it doesn't cause any functionality regression.

Can we wait a till Monday before we merge? Let me check why we added forced GC. I think we did some analysis when adding this

@vrajat
Copy link
Contributor Author

vrajat commented Jul 17, 2025

I understand the intention of the change to improve performance. Have we validated through some integ-test or running in a prod like environment with actual queries that it doesn't cause any functionality regression.

Can we wait a till Monday before we merge? Let me check why we added forced GC. I think we did some analysis when adding this

@praveenc7 - Would definitely like to understand why this was added. This call is consistently causing a major collection and pausing the JVM both in production and a cluster test based on production-sized dataset and workloads.

Java Docs also mention this is should be generally avoided: https://www.oracle.com/java/technologies/javase/gc-tuning-6.html#other_considerations

You can check the JFR screenshot in the attached issue #16373

@vrajat
Copy link
Contributor Author

vrajat commented Jul 17, 2025

The new tests added in #16299 are failing.

Found a bug - fixed.

@vrajat
Copy link
Contributor Author

vrajat commented Jul 17, 2025

This SO answer expresses what I know from experience well: https://stackoverflow.com/a/2414120

The call was added 3 years ago: 69ad81b

Copy link
Contributor

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, invoking GC explicitly is an anti-pattern. It generally reduces overall GC throughput while consuming more resources, and the only likely scenario it could be useful in is to tune latency vs throughput at a particular point in the application lifecycle (which isn't applicable to Pinot and definitely not the way we're using it here). Furthermore, for garbage collectors like G1GC, explicit System.gc() calls trigger a full, stop-the-world (STW) garbage collection by default.

@yashmayya
Copy link
Contributor

This SO answer expresses what I know from experience as well. https://stackoverflow.com/a/2414120

This answer in the same thread also explains how it can actually hurt performance and reduce overall GC throughput - https://stackoverflow.com/a/2414621.

@codecov-commenter
Copy link

codecov-commenter commented Jul 17, 2025

Codecov Report

Attention: Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.

Project coverage is 63.31%. Comparing base (1a476de) to head (b1bb489).
Report is 463 commits behind head on master.

Files with missing lines Patch % Lines
...re/accounting/PerQueryCPUMemAccountantFactory.java 57.14% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #16374      +/-   ##
============================================
+ Coverage     62.90%   63.31%   +0.41%     
+ Complexity     1386     1363      -23     
============================================
  Files          2867     2976     +109     
  Lines        163354   172766    +9412     
  Branches      24952    26459    +1507     
============================================
+ Hits         102755   109392    +6637     
- Misses        52847    55024    +2177     
- Partials       7752     8350     +598     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.27% <57.14%> (+0.40%) ⬆️
java-21 63.30% <57.14%> (+0.47%) ⬆️
skip-bytebuffers-false ?
skip-bytebuffers-true ?
temurin 63.31% <57.14%> (+0.41%) ⬆️
unittests 63.31% <57.14%> (+0.41%) ⬆️
unittests1 56.45% <57.14%> (+0.63%) ⬆️
unittests2 33.31% <0.00%> (-0.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gortiz gortiz merged commit 12e6c88 into apache:master Jul 18, 2025
18 checks passed
@vrajat vrajat deleted the rv-tracing-remove-gc branch July 18, 2025 07:09
mqliang pushed a commit to mqliang/pinot that referenced this pull request Feb 10, 2026
* [Query Resource Isolation] Workload Configs (apache#15109)

* Workload Configs

* workload config

* Add API

* config

* Change config structure

* Propagation strategy

* Fix style check

* Cost spliting on update

* Table addition propagation

* perf

* Tests

* test

* test 2

* Review comments 1

* review comments 3

* review comments 3

* name change

* review comments 4

* Fix TableDoesNotExistError for hybrid tables in MSE queries in controller API (apache#16102)

* Make ThreadResourceUsageProvider a Helper/Utility Class. (apache#16051)

* ThreadResourceUsageProvider is a helper class. ThreadResourceContext tracks resource usage.

Fix updateConcurrently

* Rename to ThreadResourceSnapshot

* Clean up

* Add javadoc

* Done use auto closeable

* Checkstyle

* Fix compilation error

* Add back removed functions in SPI

* Remove private constructor because japicmp complains.

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Fix test

* Fix ThreadResourceSnapshot usage and tests

* Store cpu sample in nanoseconds.

* Reduce logs and improve logging when queries are terminated due to OOM. (apache#16172)

* Dynamic PerQueryCPUMemAccountant Config on Servers  (apache#16219)

* Checkpoint

* Register change handler

* Fix bugs. Manually tested

* Checkstyle

* Tests

* Add pre-check that values are default

* Undo typo fix

* Update QueryRunner to make use of window function overflow handling server configurations (apache#16108)

* Add multistage thread limiting configs at the broker and server level (apache#16080)

* Adding changes for supporting RLS (apache#16043)

* Use stats cache on error instead of the chained mechanism (apache#15992)

* Improve broker error messaging when broker is the one reporting the failure (apache#16076)

* Introduce MSE active and passive timeouts (apache#16075)

* Throttle SSE & MSE Tasks if Server heap usage is above a threshold (apache#16271)

* Fix QueryScheduler constructor using class name. (apache#16280)

* Fix QueryScheduler constructor using class name.

* Fix test

* [Query Resource Isolation] WorkloadBudgetManager and Host enforcement (apache#15798)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* Remove singleton & signature fix

* Fix compatibility checker

* Review comments

* Move WorkloadBudgetManager to core.

---------

Co-authored-by: praveenc7 <praveenkchaganlal@gmail.com>

* Eliminate duplicate cancel attempts in PerQueryCPUMemAccountant (apache#16299)

* Add basic 1 query tests

* Add more tests

* Add ability to remember cancel queries.

* Clean up if conditions in killMostExpensiveQuery

* Fix test failures.

* Address review comments.

* Use QueryCancelCallback to cancel queries from ThreadResourceUsageAccountant (apache#16142)

* Remove all calls to System.gc() in PerQueryCPUMemAccountantFactory (apache#16374)

* Initialize thread accountant just before serving queries (apache#16326)

* Allow Reset of ThreadResourceUsageAccountant in Tracing.java (apache#16360)

* Queries now self terminate if in panic mode. (apache#16380)

* Queries now self terminate if in panic mode.

* Add config test

* Hard kill on critical level.

* Fix configs

* Separate anchor thread interruption.

* Checkstyle

* Review comments

* remove code for critical level

---------

Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>

* [Query Resource Isolation] Additonal Sampling for Broker and Server (apache#16164)

* fix

* sampling

* Broker sampling

* revert integ-test

* Fix test failures

* review comments

* remove MSE

* broker auth

* remove per pruner & planner sample

* Use Broker's accountant to sample in the request handler. (apache#16439)

* [Query Resource Isolation] Workload Scheduler (apache#16018)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* scheduler

* unit test

* review comments: metrics, secondary, resource-manager

* remove broker admission

* Remove default budget

---------

Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>

* Cleanup deprecated methods in ThreadResourceUsageAccountant (apache#16479)

* Remove unnecessary methods and config for ThreadResourceUsageAccountant (apache#16490)

* Add tests for OOM Termination of MSE queries. (apache#16514)

* Fix a flaky assert when testing OOM Cancellation of MSE Queries (apache#16533)

* Disable Flaky Tests (apache#16554)

This is a follow-up to apache#16533
The fix for a flaky test did not work. This PR disables these tests temporarily.

* Use correlation ID instead of request id in PerQueryCpuMemAccountant (apache#16040)

* [Query Resource Isolation]Interface for Workload Stats Collection (apache#16340)

* Interface for Stats Collection

* Additional comments

* inherit

* additional class comments

* [Query Resource Isolation] Fix Refresh message (apache#16636)

* Fix Refresh message

* delete queryworkload message handler

* info -> debug logs

* reduce logging (apache#16698)

* style check

* [Query Workload Isolation] Cost-split support  (apache#16672)

* splits

* Cost split

* test

* propagation entity change & java doc

* Propagation scheme review comments

* empty commit to trigger build

* Reduce log for PerQueryCPUMemResourceUsageAccountant (apache#16642)

---------

Co-authored-by: Rajat Venkatesh <1638298+vrajat@users.noreply.github.com>
Co-authored-by: Yash Mayya <yash.mayya@gmail.com>
Co-authored-by: Satwik Pachigolla <40644097+satwik-pachigolla@users.noreply.github.com>
Co-authored-by: 9aman <35227405+9aman@users.noreply.github.com>
Co-authored-by: Gonzalo Ortiz Jaureguizar <gortiz@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan <vvivekiyer@gmail.com>
Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>
mqliang pushed a commit to mqliang/pinot that referenced this pull request Feb 10, 2026
* [Query Resource Isolation] Workload Configs (apache#15109)

* Workload Configs

* workload config

* Add API

* config

* Change config structure

* Propagation strategy

* Fix style check

* Cost spliting on update

* Table addition propagation

* perf

* Tests

* test

* test 2

* Review comments 1

* review comments 3

* review comments 3

* name change

* review comments 4

* Fix TableDoesNotExistError for hybrid tables in MSE queries in controller API (apache#16102)

* Make ThreadResourceUsageProvider a Helper/Utility Class. (apache#16051)

* ThreadResourceUsageProvider is a helper class. ThreadResourceContext tracks resource usage.

Fix updateConcurrently

* Rename to ThreadResourceSnapshot

* Clean up

* Add javadoc

* Done use auto closeable

* Checkstyle

* Fix compilation error

* Add back removed functions in SPI

* Remove private constructor because japicmp complains.

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Fix test

* Fix ThreadResourceSnapshot usage and tests

* Store cpu sample in nanoseconds.

* Reduce logs and improve logging when queries are terminated due to OOM. (apache#16172)

* Dynamic PerQueryCPUMemAccountant Config on Servers  (apache#16219)

* Checkpoint

* Register change handler

* Fix bugs. Manually tested

* Checkstyle

* Tests

* Add pre-check that values are default

* Undo typo fix

* Update QueryRunner to make use of window function overflow handling server configurations (apache#16108)

* Add multistage thread limiting configs at the broker and server level (apache#16080)

* Adding changes for supporting RLS (apache#16043)

* Use stats cache on error instead of the chained mechanism (apache#15992)

* Improve broker error messaging when broker is the one reporting the failure (apache#16076)

* Introduce MSE active and passive timeouts (apache#16075)

* Throttle SSE & MSE Tasks if Server heap usage is above a threshold (apache#16271)

* Fix QueryScheduler constructor using class name. (apache#16280)

* Fix QueryScheduler constructor using class name.

* Fix test

* [Query Resource Isolation] WorkloadBudgetManager and Host enforcement (apache#15798)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* Remove singleton & signature fix

* Fix compatibility checker

* Review comments

* Move WorkloadBudgetManager to core.

---------

Co-authored-by: praveenc7 <praveenkchaganlal@gmail.com>

* Eliminate duplicate cancel attempts in PerQueryCPUMemAccountant (apache#16299)

* Add basic 1 query tests

* Add more tests

* Add ability to remember cancel queries.

* Clean up if conditions in killMostExpensiveQuery

* Fix test failures.

* Address review comments.

* Use QueryCancelCallback to cancel queries from ThreadResourceUsageAccountant (apache#16142)

* Remove all calls to System.gc() in PerQueryCPUMemAccountantFactory (apache#16374)

* Initialize thread accountant just before serving queries (apache#16326)

* Allow Reset of ThreadResourceUsageAccountant in Tracing.java (apache#16360)

* Queries now self terminate if in panic mode. (apache#16380)

* Queries now self terminate if in panic mode.

* Add config test

* Hard kill on critical level.

* Fix configs

* Separate anchor thread interruption.

* Checkstyle

* Review comments

* remove code for critical level

---------

Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>

* [Query Resource Isolation] Additonal Sampling for Broker and Server (apache#16164)

* fix

* sampling

* Broker sampling

* revert integ-test

* Fix test failures

* review comments

* remove MSE

* broker auth

* remove per pruner & planner sample

* Use Broker's accountant to sample in the request handler. (apache#16439)

* [Query Resource Isolation] Workload Scheduler (apache#16018)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* scheduler

* unit test

* review comments: metrics, secondary, resource-manager

* remove broker admission

* Remove default budget

---------

Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>

* Cleanup deprecated methods in ThreadResourceUsageAccountant (apache#16479)

* Remove unnecessary methods and config for ThreadResourceUsageAccountant (apache#16490)

* Add tests for OOM Termination of MSE queries. (apache#16514)

* Fix a flaky assert when testing OOM Cancellation of MSE Queries (apache#16533)

* Disable Flaky Tests (apache#16554)

This is a follow-up to apache#16533
The fix for a flaky test did not work. This PR disables these tests temporarily.

* Use correlation ID instead of request id in PerQueryCpuMemAccountant (apache#16040)

* [Query Resource Isolation]Interface for Workload Stats Collection (apache#16340)

* Interface for Stats Collection

* Additional comments

* inherit

* additional class comments

* [Query Resource Isolation] Fix Refresh message (apache#16636)

* Fix Refresh message

* delete queryworkload message handler

* info -> debug logs

* reduce logging (apache#16698)

* style check

* [Query Workload Isolation] Cost-split support  (apache#16672)

* splits

* Cost split

* test

* propagation entity change & java doc

* Propagation scheme review comments

* empty commit to trigger build

* Reduce log for PerQueryCPUMemResourceUsageAccountant (apache#16642)

* [refactor] Switching to RoutingManager for broker request handlers (apache#16442)

Co-authored-by: Shaurya Chaturvedi <shauryachats@uber.com>

* Fix broker request id generator to avoid generating same id (apache#16661)

* Introduce QueryExecutionContext to manage query life cycle (apache#16728)

* Introduce QueryExecutionContext to manage query life cycle 2 (apache#16728)

---------

Co-authored-by: Rajat Venkatesh <1638298+vrajat@users.noreply.github.com>
Co-authored-by: Yash Mayya <yash.mayya@gmail.com>
Co-authored-by: Satwik Pachigolla <40644097+satwik-pachigolla@users.noreply.github.com>
Co-authored-by: 9aman <35227405+9aman@users.noreply.github.com>
Co-authored-by: Gonzalo Ortiz Jaureguizar <gortiz@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan <vvivekiyer@gmail.com>
Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>
Co-authored-by: Shaurya Chaturvedi <shauryachats@gmail.com>
Co-authored-by: Shaurya Chaturvedi <shauryachats@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

System.gc() call in Panic Mode in PerQueryCPUMemAccountant causes high GC pause

6 participants