Skip to content

Allow Reset of ThreadResourceUsageAccountant in Tracing.java#16360

Merged
gortiz merged 13 commits intoapache:masterfrom
vrajat:rv-tracing-testable
Jul 23, 2025
Merged

Allow Reset of ThreadResourceUsageAccountant in Tracing.java#16360
gortiz merged 13 commits intoapache:masterfrom
vrajat:rv-tracing-testable

Conversation

@vrajat
Copy link
Contributor

@vrajat vrajat commented Jul 16, 2025

This PR breaks the shackles of allowing only one ThreadResourceUsageAccountant in a JVM due to the following code:

    static final ThreadResourceUsageAccountant ACCOUNTANT =
        ACCOUNTANT_REGISTRATION.get() == null ? createDefaultThreadAccountant() : ACCOUNTANT_REGISTRATION.get();

While this makes sense in production where the accountant is setup during startup, it made testing hard. Its not possible to change the accountant for every test. Consequently, ResourceManagerAccountantTest was not really testing the accountant created in that specific test.

Another issue was that it was not possible to get the accountant created by the broker or server and check its state in tests. This could only be done by carefully orchestrating the startup sequence. This made test triage hard for those unfamiliar with this module.

This also enables a class of tests where a new accountant can be re-initialized by restarting a server or broker. So a single integration test can check different combinations of the initialization code.

The final issue was that the watcher task ran in a static executor service. So even if a test created multiple thread accountants, only the first one's watcher task used to run. The others were queued.

The following are the main changes:
Holder.ACCOUNTANT is no more a static final variable. The final restriction has been removed. register, unregisterThreadAccountant, createDefaultAccountant and initializeThreadAccountant are available to setup the accountant correctly in Server, Broker and in tests. unregisterThreadAccountant is especially useful to reset the global state between tests.

createDefaultAccountant has been added to return the thread accountant that was created. A server or broker stores the accountant it created even if it wasnt registered with the global singleton.

The executor service is now a member variable of the accountant and not a static member of the class.

Closes #15231

@codecov-commenter
Copy link

codecov-commenter commented Jul 16, 2025

Codecov Report

Attention: Patch coverage is 44.61538% with 36 lines in your changes missing coverage. Please review.

Project coverage is 63.34%. Comparing base (1a476de) to head (58130ca).
Report is 489 commits behind head on master.

Files with missing lines Patch % Lines
...inot/controller/helix/ControllerRequestClient.java 0.00% 13 Missing ⚠️
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% 9 Missing ⚠️
.../main/java/org/apache/pinot/spi/trace/Tracing.java 65.38% 5 Missing and 4 partials ⚠️
...re/accounting/PerQueryCPUMemAccountantFactory.java 66.66% 1 Missing and 1 partial ⚠️
...spi/utils/builder/ControllerRequestURLBuilder.java 0.00% 2 Missing ⚠️
...ore/accounting/ResourceUsageAccountantFactory.java 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #16360      +/-   ##
============================================
+ Coverage     62.90%   63.34%   +0.44%     
+ Complexity     1386     1363      -23     
============================================
  Files          2867     2984     +117     
  Lines        163354   173335    +9981     
  Branches      24952    26562    +1610     
============================================
+ Hits         102755   109803    +7048     
- Misses        52847    55143    +2296     
- Partials       7752     8389     +637     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.31% <44.61%> (+0.44%) ⬆️
java-21 63.33% <44.61%> (+0.51%) ⬆️
skip-bytebuffers-false ?
skip-bytebuffers-true ?
temurin 63.34% <44.61%> (+0.44%) ⬆️
unittests 63.34% <44.61%> (+0.44%) ⬆️
unittests1 56.48% <65.85%> (+0.66%) ⬆️
unittests2 33.29% <3.07%> (-0.28%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vrajat vrajat force-pushed the rv-tracing-testable branch from 8e3fde2 to 5d77d2b Compare July 18, 2025 12:17
@vrajat vrajat force-pushed the rv-tracing-testable branch from 810850e to cc537d8 Compare July 22, 2025 04:04
@vrajat vrajat changed the title Allow reset of ThreadResourceUsageAccountant Allow reset of ThreadResourceUsageAccountant in Tracing.java Jul 22, 2025
@vrajat vrajat changed the title Allow reset of ThreadResourceUsageAccountant in Tracing.java Allow Reset of ThreadResourceUsageAccountant in Tracing.java Jul 22, 2025
@vrajat vrajat requested review from Copilot, gortiz, vvivekiyer and yashmayya and removed request for yashmayya July 22, 2025 05:33
@vrajat vrajat marked this pull request as ready for review July 22, 2025 05:33
@vrajat
Copy link
Contributor Author

vrajat commented Jul 22, 2025

@praveenc7 cc

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR modifies the ThreadResourceUsageAccountant system in Pinot to allow resetting and re-registering accountants, primarily to improve testability. The key changes remove the static final restriction on the accountant singleton and add proper lifecycle management methods.

  • Removes static final restriction on ThreadResourceUsageAccountant singleton to allow reset/re-registration
  • Adds lifecycle management methods (unregister, createThreadAccountant) for better test control
  • Moves executor service from static to instance member to avoid task queuing issues in tests

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
Tracing.java Core changes to make accountant non-final and add lifecycle methods
ThreadResourceUsageAccountant.java Adds stopWatcherTask default method interface
ResourceUsageAccountantFactory.java Moves executor to instance variable and implements stopWatcherTask
PerQueryCPUMemAccountantFactory.java Moves executor to instance variable and implements stopWatcherTask
BaseServerStarter.java Updates to use createThreadAccountant and store reference
BaseBrokerStarter.java Updates to use createThreadAccountant and store reference
ResourceManagerAccountingTest.java Updates tests to use new accountant lifecycle methods
OOMProtectionEnabledIntegrationTest.java Adds integration tests for accountant reset functionality
ControllerRequestURLBuilder.java Adds cluster config management URLs
ControllerRequestClient.java Adds cluster config update/delete methods
ControllerTest.java Adds cluster config test helper methods
CPUMemThreadLevelAccountingObjects.java Clears error status when setting thread to idle
Comments suppressed due to low confidence (1)

pinot-spi/src/main/java/org/apache/pinot/spi/trace/Tracing.java:70

  • [nitpick] The field name '_accountant' uses underscore prefix which is inconsistent with Java naming conventions for static fields. Consider renaming to 'ACCOUNTANT' to maintain consistency with other static final fields in the class.
    static ThreadResourceUsageAccountant _accountant =

Comment on lines +147 to +148
Holder._accountant = accountant;
ACCOUNTANT_REGISTRATION.set(accountant);
Copy link

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting ACCOUNTANT_REGISTRATION after already setting Holder._accountant creates potential race conditions. The registration should be set before updating the holder to ensure consistency.

Suggested change
Holder._accountant = accountant;
ACCOUNTANT_REGISTRATION.set(accountant);
ACCOUNTANT_REGISTRATION.set(accountant);
Holder._accountant = accountant;

Copilot uses AI. Check for mistakes.
public void run() {
LOGGER.debug("Running timed task for {}", this.getClass().getName());
while (true) {
while (!Thread.currentThread().isInterrupted()) {
Copy link

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The while loop should also check the interrupted flag at the beginning of each iteration to ensure prompt response to interruption. Consider adding Thread.interrupted() check after the sleep/wait operations.

Copilot uses AI. Check for mistakes.
@Override
public void run() {
while (true) {
while (!Thread.currentThread().isInterrupted()) {
Copy link

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The while loop should also check the interrupted flag at the beginning of each iteration to ensure prompt response to interruption. Consider adding Thread.interrupted() check after the sleep/wait operations.

Copilot uses AI. Check for mistakes.
@vrajat vrajat requested a review from xiangfu0 July 22, 2025 05:38
@gortiz gortiz merged commit 90346d5 into apache:master Jul 23, 2025
18 checks passed
@@ -323,25 +344,35 @@ public static void clear() {

public static void initializeThreadAccountant(PinotConfiguration config, String instanceId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vrajat Is there a reason other than backward compatibility we want to still keep this interface around? I don't see it used anywhere other than some test. Maybe we can replace it with createThreadAccountant for those handful of cases, It might create confusion when to you what, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I think we should do a massive cleanup of all the deprecated functions as well. I haven't yet because it needs work across the workload budget manager as well. Let's coordinate next week ?

mqliang pushed a commit to mqliang/pinot that referenced this pull request Feb 10, 2026
* [Query Resource Isolation] Workload Configs (apache#15109)

* Workload Configs

* workload config

* Add API

* config

* Change config structure

* Propagation strategy

* Fix style check

* Cost spliting on update

* Table addition propagation

* perf

* Tests

* test

* test 2

* Review comments 1

* review comments 3

* review comments 3

* name change

* review comments 4

* Fix TableDoesNotExistError for hybrid tables in MSE queries in controller API (apache#16102)

* Make ThreadResourceUsageProvider a Helper/Utility Class. (apache#16051)

* ThreadResourceUsageProvider is a helper class. ThreadResourceContext tracks resource usage.

Fix updateConcurrently

* Rename to ThreadResourceSnapshot

* Clean up

* Add javadoc

* Done use auto closeable

* Checkstyle

* Fix compilation error

* Add back removed functions in SPI

* Remove private constructor because japicmp complains.

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Fix test

* Fix ThreadResourceSnapshot usage and tests

* Store cpu sample in nanoseconds.

* Reduce logs and improve logging when queries are terminated due to OOM. (apache#16172)

* Dynamic PerQueryCPUMemAccountant Config on Servers  (apache#16219)

* Checkpoint

* Register change handler

* Fix bugs. Manually tested

* Checkstyle

* Tests

* Add pre-check that values are default

* Undo typo fix

* Update QueryRunner to make use of window function overflow handling server configurations (apache#16108)

* Add multistage thread limiting configs at the broker and server level (apache#16080)

* Adding changes for supporting RLS (apache#16043)

* Use stats cache on error instead of the chained mechanism (apache#15992)

* Improve broker error messaging when broker is the one reporting the failure (apache#16076)

* Introduce MSE active and passive timeouts (apache#16075)

* Throttle SSE & MSE Tasks if Server heap usage is above a threshold (apache#16271)

* Fix QueryScheduler constructor using class name. (apache#16280)

* Fix QueryScheduler constructor using class name.

* Fix test

* [Query Resource Isolation] WorkloadBudgetManager and Host enforcement (apache#15798)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* Remove singleton & signature fix

* Fix compatibility checker

* Review comments

* Move WorkloadBudgetManager to core.

---------

Co-authored-by: praveenc7 <praveenkchaganlal@gmail.com>

* Eliminate duplicate cancel attempts in PerQueryCPUMemAccountant (apache#16299)

* Add basic 1 query tests

* Add more tests

* Add ability to remember cancel queries.

* Clean up if conditions in killMostExpensiveQuery

* Fix test failures.

* Address review comments.

* Use QueryCancelCallback to cancel queries from ThreadResourceUsageAccountant (apache#16142)

* Remove all calls to System.gc() in PerQueryCPUMemAccountantFactory (apache#16374)

* Initialize thread accountant just before serving queries (apache#16326)

* Allow Reset of ThreadResourceUsageAccountant in Tracing.java (apache#16360)

* Queries now self terminate if in panic mode. (apache#16380)

* Queries now self terminate if in panic mode.

* Add config test

* Hard kill on critical level.

* Fix configs

* Separate anchor thread interruption.

* Checkstyle

* Review comments

* remove code for critical level

---------

Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>

* [Query Resource Isolation] Additonal Sampling for Broker and Server (apache#16164)

* fix

* sampling

* Broker sampling

* revert integ-test

* Fix test failures

* review comments

* remove MSE

* broker auth

* remove per pruner & planner sample

* Use Broker's accountant to sample in the request handler. (apache#16439)

* [Query Resource Isolation] Workload Scheduler (apache#16018)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* scheduler

* unit test

* review comments: metrics, secondary, resource-manager

* remove broker admission

* Remove default budget

---------

Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>

* Cleanup deprecated methods in ThreadResourceUsageAccountant (apache#16479)

* Remove unnecessary methods and config for ThreadResourceUsageAccountant (apache#16490)

* Add tests for OOM Termination of MSE queries. (apache#16514)

* Fix a flaky assert when testing OOM Cancellation of MSE Queries (apache#16533)

* Disable Flaky Tests (apache#16554)

This is a follow-up to apache#16533
The fix for a flaky test did not work. This PR disables these tests temporarily.

* Use correlation ID instead of request id in PerQueryCpuMemAccountant (apache#16040)

* [Query Resource Isolation]Interface for Workload Stats Collection (apache#16340)

* Interface for Stats Collection

* Additional comments

* inherit

* additional class comments

* [Query Resource Isolation] Fix Refresh message (apache#16636)

* Fix Refresh message

* delete queryworkload message handler

* info -> debug logs

* reduce logging (apache#16698)

* style check

* [Query Workload Isolation] Cost-split support  (apache#16672)

* splits

* Cost split

* test

* propagation entity change & java doc

* Propagation scheme review comments

* empty commit to trigger build

* Reduce log for PerQueryCPUMemResourceUsageAccountant (apache#16642)

---------

Co-authored-by: Rajat Venkatesh <1638298+vrajat@users.noreply.github.com>
Co-authored-by: Yash Mayya <yash.mayya@gmail.com>
Co-authored-by: Satwik Pachigolla <40644097+satwik-pachigolla@users.noreply.github.com>
Co-authored-by: 9aman <35227405+9aman@users.noreply.github.com>
Co-authored-by: Gonzalo Ortiz Jaureguizar <gortiz@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan <vvivekiyer@gmail.com>
Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>
mqliang pushed a commit to mqliang/pinot that referenced this pull request Feb 10, 2026
* [Query Resource Isolation] Workload Configs (apache#15109)

* Workload Configs

* workload config

* Add API

* config

* Change config structure

* Propagation strategy

* Fix style check

* Cost spliting on update

* Table addition propagation

* perf

* Tests

* test

* test 2

* Review comments 1

* review comments 3

* review comments 3

* name change

* review comments 4

* Fix TableDoesNotExistError for hybrid tables in MSE queries in controller API (apache#16102)

* Make ThreadResourceUsageProvider a Helper/Utility Class. (apache#16051)

* ThreadResourceUsageProvider is a helper class. ThreadResourceContext tracks resource usage.

Fix updateConcurrently

* Rename to ThreadResourceSnapshot

* Clean up

* Add javadoc

* Done use auto closeable

* Checkstyle

* Fix compilation error

* Add back removed functions in SPI

* Remove private constructor because japicmp complains.

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Fix test

* Fix ThreadResourceSnapshot usage and tests

* Store cpu sample in nanoseconds.

* Reduce logs and improve logging when queries are terminated due to OOM. (apache#16172)

* Dynamic PerQueryCPUMemAccountant Config on Servers  (apache#16219)

* Checkpoint

* Register change handler

* Fix bugs. Manually tested

* Checkstyle

* Tests

* Add pre-check that values are default

* Undo typo fix

* Update QueryRunner to make use of window function overflow handling server configurations (apache#16108)

* Add multistage thread limiting configs at the broker and server level (apache#16080)

* Adding changes for supporting RLS (apache#16043)

* Use stats cache on error instead of the chained mechanism (apache#15992)

* Improve broker error messaging when broker is the one reporting the failure (apache#16076)

* Introduce MSE active and passive timeouts (apache#16075)

* Throttle SSE & MSE Tasks if Server heap usage is above a threshold (apache#16271)

* Fix QueryScheduler constructor using class name. (apache#16280)

* Fix QueryScheduler constructor using class name.

* Fix test

* [Query Resource Isolation] WorkloadBudgetManager and Host enforcement (apache#15798)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* Remove singleton & signature fix

* Fix compatibility checker

* Review comments

* Move WorkloadBudgetManager to core.

---------

Co-authored-by: praveenc7 <praveenkchaganlal@gmail.com>

* Eliminate duplicate cancel attempts in PerQueryCPUMemAccountant (apache#16299)

* Add basic 1 query tests

* Add more tests

* Add ability to remember cancel queries.

* Clean up if conditions in killMostExpensiveQuery

* Fix test failures.

* Address review comments.

* Use QueryCancelCallback to cancel queries from ThreadResourceUsageAccountant (apache#16142)

* Remove all calls to System.gc() in PerQueryCPUMemAccountantFactory (apache#16374)

* Initialize thread accountant just before serving queries (apache#16326)

* Allow Reset of ThreadResourceUsageAccountant in Tracing.java (apache#16360)

* Queries now self terminate if in panic mode. (apache#16380)

* Queries now self terminate if in panic mode.

* Add config test

* Hard kill on critical level.

* Fix configs

* Separate anchor thread interruption.

* Checkstyle

* Review comments

* remove code for critical level

---------

Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>

* [Query Resource Isolation] Additonal Sampling for Broker and Server (apache#16164)

* fix

* sampling

* Broker sampling

* revert integ-test

* Fix test failures

* review comments

* remove MSE

* broker auth

* remove per pruner & planner sample

* Use Broker's accountant to sample in the request handler. (apache#16439)

* [Query Resource Isolation] Workload Scheduler (apache#16018)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* scheduler

* unit test

* review comments: metrics, secondary, resource-manager

* remove broker admission

* Remove default budget

---------

Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>

* Cleanup deprecated methods in ThreadResourceUsageAccountant (apache#16479)

* Remove unnecessary methods and config for ThreadResourceUsageAccountant (apache#16490)

* Add tests for OOM Termination of MSE queries. (apache#16514)

* Fix a flaky assert when testing OOM Cancellation of MSE Queries (apache#16533)

* Disable Flaky Tests (apache#16554)

This is a follow-up to apache#16533
The fix for a flaky test did not work. This PR disables these tests temporarily.

* Use correlation ID instead of request id in PerQueryCpuMemAccountant (apache#16040)

* [Query Resource Isolation]Interface for Workload Stats Collection (apache#16340)

* Interface for Stats Collection

* Additional comments

* inherit

* additional class comments

* [Query Resource Isolation] Fix Refresh message (apache#16636)

* Fix Refresh message

* delete queryworkload message handler

* info -> debug logs

* reduce logging (apache#16698)

* style check

* [Query Workload Isolation] Cost-split support  (apache#16672)

* splits

* Cost split

* test

* propagation entity change & java doc

* Propagation scheme review comments

* empty commit to trigger build

* Reduce log for PerQueryCPUMemResourceUsageAccountant (apache#16642)

* [refactor] Switching to RoutingManager for broker request handlers (apache#16442)

Co-authored-by: Shaurya Chaturvedi <shauryachats@uber.com>

* Fix broker request id generator to avoid generating same id (apache#16661)

* Introduce QueryExecutionContext to manage query life cycle (apache#16728)

* Introduce QueryExecutionContext to manage query life cycle 2 (apache#16728)

---------

Co-authored-by: Rajat Venkatesh <1638298+vrajat@users.noreply.github.com>
Co-authored-by: Yash Mayya <yash.mayya@gmail.com>
Co-authored-by: Satwik Pachigolla <40644097+satwik-pachigolla@users.noreply.github.com>
Co-authored-by: 9aman <35227405+9aman@users.noreply.github.com>
Co-authored-by: Gonzalo Ortiz Jaureguizar <gortiz@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan <vvivekiyer@gmail.com>
Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>
Co-authored-by: Shaurya Chaturvedi <shauryachats@gmail.com>
Co-authored-by: Shaurya Chaturvedi <shauryachats@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tracing::ThreadUsageResourceAccountant cannot be isolated and tested in integration tests

6 participants