Skip to content

[Query Resource Isolation] Additonal Sampling for Broker and Server#16164

Merged
vvivekiyer merged 12 commits intoapache:masterfrom
praveenc7:sample
Jul 25, 2025
Merged

[Query Resource Isolation] Additonal Sampling for Broker and Server#16164
vvivekiyer merged 12 commits intoapache:masterfrom
praveenc7:sample

Conversation

@praveenc7
Copy link
Contributor

@praveenc7 praveenc7 commented Jun 20, 2025

Summary

Query-Resource-Isolation (QRI) relies on periodic CPU / memory samples to detect “run-away” queries and enforce per-workload budgets.
In several incidents we found that no samples are collected during three cost-intensive phases, so large queries sometimes survive far longer than the budget window:

Component Phase currently unsampled Consequence
Broker SQL compilation Big plans compile for > 1 s with 0 ns billed
Broker Routing table build Large fan-out hits un-metered
Server Plan building Segment-level planning can take seconds on wide tables

This PR addresses those gaps by introducing lightweight sampling hooks at the end of each phase on the Broker side.

On the Server, sampling is added at the per-segment plan level.

These hooks are enabled by default since the overhead is minimal:
• On the Broker, sampling incurs only a single MXBean call.
• On the Server, sampling is bounded by the number of segments planned. For valid queries, P95th of queries typically involves fewer than 100 segments per server, with a P50 around 45 segments.

Testing Done

Manual quick start and integ test to validate if sampling is happening during this path.

@codecov-commenter
Copy link

codecov-commenter commented Jun 20, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.30%. Comparing base (1a476de) to head (76ac953).
Report is 489 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #16164      +/-   ##
============================================
+ Coverage     62.90%   63.30%   +0.40%     
+ Complexity     1386     1363      -23     
============================================
  Files          2867     2976     +109     
  Lines        163354   172937    +9583     
  Branches      24952    26497    +1545     
============================================
+ Hits         102755   109485    +6730     
- Misses        52847    55083    +2236     
- Partials       7752     8369     +617     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.28% <100.00%> (+0.41%) ⬆️
java-21 63.28% <100.00%> (+0.46%) ⬆️
skip-bytebuffers-false ?
skip-bytebuffers-true ?
temurin 63.30% <100.00%> (+0.40%) ⬆️
unittests 63.30% <100.00%> (+0.40%) ⬆️
unittests1 56.45% <100.00%> (+0.62%) ⬆️
unittests2 33.28% <60.00%> (-0.29%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@praveenc7 praveenc7 marked this pull request as ready for review July 14, 2025 17:40
Copy link
Contributor

@vvivekiyer vvivekiyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pchaganl In our deployment we noticed a few other misses (eg: Distinct Queries, PercentileTDigest). Can we make sure to add sampling there too?

_brokerMetrics.addPhaseTiming(rawTableName, BrokerQueryPhase.QUERY_ROUTING,
routingEndTimeNs - routingStartTimeNs);
// Account the resource used for routing phase
Tracing.ThreadAccountantOps.sample();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a sample at the end of every phase:

  1. Request Compilation
  2. Authorization
  3. Routing

Also add a comment here as to why sampling for routing is important - with single threaded segment pruners, this can take resources when there are a lot of segments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added for Request Compilation & Routing. I did consider for authorization and looked into our latency metric for authorization it is less than 0.1 ms . So avoided it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Prefer moving the sampling to the location where we compute phase timing for each phase. That way it's evident that we want to sample each broker phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Routing was closer to the phase calculation moved compilation as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add authorization. The idea is that if there's a bug/deployment issue in this phase, we catch it.

compileRequest(requestId, query, sqlNodeAndOptions, request, requesterIdentity, requestContext, httpHeaders,
accessControl);
// Accounts for resource usage of the compilation phase
Tracing.ThreadAccountantOps.sample();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also make this change for the other handlers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it is added for SSE and MSE.

GrpcBrokerRequestHandler, SingleConnectionBrokerRequestHandler extend this handler for SSE. TimeSeriesRequestHandler seems to be custom handler.

For MSE it is added in QueryEnvironment

public PlanNode makeSegmentPlanNode(SegmentContext segmentContext, QueryContext queryContext) {
rewriteQueryContextWithHints(queryContext, segmentContext.getIndexSegment());
// Sample to track usage of query planning
Tracing.ThreadAccountantOps.sample();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use sampleAndCheckInterruption() in all these places to make sure the thread exists if it's already been interrupted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the usage of sampleAndCheckInterruption() mostly on broker on merge. I taught there was reason for doing that specifically on broker.

Is sampleAndCheckInterruption() a more evolved approach we want to move on all places on server as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should see it used on server as well in some of the operators. The only difference is one blindly samples. While the other checks for interruption and throws exception before sampling.

}
RelRoot relRoot = compileQuery(queryNode, plannerContext);
// Accounts for resource usage of the compilation phase
Tracing.ThreadAccountantOps.sampleMSE();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setupRunner is called much later before dispatch in MultistageBrokerRequestHandler. Two questions here:

  1. It will not work for multistage handlers.
  2. If this is for the single stage handler fallback path, is this really necessary if we already are sampling after the compilation phase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You are right it was setup after compilation, moved it up, take a look if looks right?
  2. This path is touched by MSE handler so it is not added as a fallback

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look right to me. The QueryEnv.compile() happens in an async call with a separate threadpool. We might have to find another way to account for these async calls in MSE.

@Override
public PlanNode makeSegmentPlanNode(SegmentContext segmentContext, QueryContext queryContext) {
rewriteQueryContextWithHints(queryContext, segmentContext.getIndexSegment());
// Sample to track usage of query planning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to track 2 things on the server right?

  1. Query Planning
  2. Segment Pruning
    Can you make sure both are tracked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for Segment pruning


try (QueryEnvironment.CompiledQuery compiledQuery =
compileQuery(requestId, query, sqlNodeAndOptions, httpHeaders, queryTimer)) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) remove spurious changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was because of a bad merge with master, fixed it

public PlanNode makeSegmentPlanNode(SegmentContext segmentContext, QueryContext queryContext) {
rewriteQueryContextWithHints(queryContext, segmentContext.getIndexSegment());
// Sample to track usage of query planning, since it can be expensive for large segment lists.
Tracing.ThreadAccountantOps.sampleAndCheckInterruption();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep all the sampling in a single place without moving it inside the planning/pruning classes.. That way it's easier to reason about it.

For example: can we add both the pruning and planning sampling in ServerQueryExecutorV1Impl.

  • Pruning sampling in executeInternal() after the selectedSegmentsInfo is computed.
  • Planning in executeQuery before the planCombineQuery is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that consolidating the sampling into a single location can improve readability, but in this case it would come at the cost of precision—because you’d only capture metrics once, after the entire operation completes. For example, we have several pruner and ideally need to record a sample after each one to maintain accuracy (and the same applies to planning).

@vvivekiyer vvivekiyer merged commit f7b9a25 into apache:master Jul 25, 2025
18 checks passed
mqliang pushed a commit to mqliang/pinot that referenced this pull request Feb 10, 2026
* [Query Resource Isolation] Workload Configs (apache#15109)

* Workload Configs

* workload config

* Add API

* config

* Change config structure

* Propagation strategy

* Fix style check

* Cost spliting on update

* Table addition propagation

* perf

* Tests

* test

* test 2

* Review comments 1

* review comments 3

* review comments 3

* name change

* review comments 4

* Fix TableDoesNotExistError for hybrid tables in MSE queries in controller API (apache#16102)

* Make ThreadResourceUsageProvider a Helper/Utility Class. (apache#16051)

* ThreadResourceUsageProvider is a helper class. ThreadResourceContext tracks resource usage.

Fix updateConcurrently

* Rename to ThreadResourceSnapshot

* Clean up

* Add javadoc

* Done use auto closeable

* Checkstyle

* Fix compilation error

* Add back removed functions in SPI

* Remove private constructor because japicmp complains.

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Fix test

* Fix ThreadResourceSnapshot usage and tests

* Store cpu sample in nanoseconds.

* Reduce logs and improve logging when queries are terminated due to OOM. (apache#16172)

* Dynamic PerQueryCPUMemAccountant Config on Servers  (apache#16219)

* Checkpoint

* Register change handler

* Fix bugs. Manually tested

* Checkstyle

* Tests

* Add pre-check that values are default

* Undo typo fix

* Update QueryRunner to make use of window function overflow handling server configurations (apache#16108)

* Add multistage thread limiting configs at the broker and server level (apache#16080)

* Adding changes for supporting RLS (apache#16043)

* Use stats cache on error instead of the chained mechanism (apache#15992)

* Improve broker error messaging when broker is the one reporting the failure (apache#16076)

* Introduce MSE active and passive timeouts (apache#16075)

* Throttle SSE & MSE Tasks if Server heap usage is above a threshold (apache#16271)

* Fix QueryScheduler constructor using class name. (apache#16280)

* Fix QueryScheduler constructor using class name.

* Fix test

* [Query Resource Isolation] WorkloadBudgetManager and Host enforcement (apache#15798)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* Remove singleton & signature fix

* Fix compatibility checker

* Review comments

* Move WorkloadBudgetManager to core.

---------

Co-authored-by: praveenc7 <praveenkchaganlal@gmail.com>

* Eliminate duplicate cancel attempts in PerQueryCPUMemAccountant (apache#16299)

* Add basic 1 query tests

* Add more tests

* Add ability to remember cancel queries.

* Clean up if conditions in killMostExpensiveQuery

* Fix test failures.

* Address review comments.

* Use QueryCancelCallback to cancel queries from ThreadResourceUsageAccountant (apache#16142)

* Remove all calls to System.gc() in PerQueryCPUMemAccountantFactory (apache#16374)

* Initialize thread accountant just before serving queries (apache#16326)

* Allow Reset of ThreadResourceUsageAccountant in Tracing.java (apache#16360)

* Queries now self terminate if in panic mode. (apache#16380)

* Queries now self terminate if in panic mode.

* Add config test

* Hard kill on critical level.

* Fix configs

* Separate anchor thread interruption.

* Checkstyle

* Review comments

* remove code for critical level

---------

Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>

* [Query Resource Isolation] Additonal Sampling for Broker and Server (apache#16164)

* fix

* sampling

* Broker sampling

* revert integ-test

* Fix test failures

* review comments

* remove MSE

* broker auth

* remove per pruner & planner sample

* Use Broker's accountant to sample in the request handler. (apache#16439)

* [Query Resource Isolation] Workload Scheduler (apache#16018)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* scheduler

* unit test

* review comments: metrics, secondary, resource-manager

* remove broker admission

* Remove default budget

---------

Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>

* Cleanup deprecated methods in ThreadResourceUsageAccountant (apache#16479)

* Remove unnecessary methods and config for ThreadResourceUsageAccountant (apache#16490)

* Add tests for OOM Termination of MSE queries. (apache#16514)

* Fix a flaky assert when testing OOM Cancellation of MSE Queries (apache#16533)

* Disable Flaky Tests (apache#16554)

This is a follow-up to apache#16533
The fix for a flaky test did not work. This PR disables these tests temporarily.

* Use correlation ID instead of request id in PerQueryCpuMemAccountant (apache#16040)

* [Query Resource Isolation]Interface for Workload Stats Collection (apache#16340)

* Interface for Stats Collection

* Additional comments

* inherit

* additional class comments

* [Query Resource Isolation] Fix Refresh message (apache#16636)

* Fix Refresh message

* delete queryworkload message handler

* info -> debug logs

* reduce logging (apache#16698)

* style check

* [Query Workload Isolation] Cost-split support  (apache#16672)

* splits

* Cost split

* test

* propagation entity change & java doc

* Propagation scheme review comments

* empty commit to trigger build

* Reduce log for PerQueryCPUMemResourceUsageAccountant (apache#16642)

---------

Co-authored-by: Rajat Venkatesh <1638298+vrajat@users.noreply.github.com>
Co-authored-by: Yash Mayya <yash.mayya@gmail.com>
Co-authored-by: Satwik Pachigolla <40644097+satwik-pachigolla@users.noreply.github.com>
Co-authored-by: 9aman <35227405+9aman@users.noreply.github.com>
Co-authored-by: Gonzalo Ortiz Jaureguizar <gortiz@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan <vvivekiyer@gmail.com>
Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>
mqliang pushed a commit to mqliang/pinot that referenced this pull request Feb 10, 2026
* [Query Resource Isolation] Workload Configs (apache#15109)

* Workload Configs

* workload config

* Add API

* config

* Change config structure

* Propagation strategy

* Fix style check

* Cost spliting on update

* Table addition propagation

* perf

* Tests

* test

* test 2

* Review comments 1

* review comments 3

* review comments 3

* name change

* review comments 4

* Fix TableDoesNotExistError for hybrid tables in MSE queries in controller API (apache#16102)

* Make ThreadResourceUsageProvider a Helper/Utility Class. (apache#16051)

* ThreadResourceUsageProvider is a helper class. ThreadResourceContext tracks resource usage.

Fix updateConcurrently

* Rename to ThreadResourceSnapshot

* Clean up

* Add javadoc

* Done use auto closeable

* Checkstyle

* Fix compilation error

* Add back removed functions in SPI

* Remove private constructor because japicmp complains.

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Add setThreadResourceUsageProvider because of backward-incompatible checks

* Fix test

* Fix ThreadResourceSnapshot usage and tests

* Store cpu sample in nanoseconds.

* Reduce logs and improve logging when queries are terminated due to OOM. (apache#16172)

* Dynamic PerQueryCPUMemAccountant Config on Servers  (apache#16219)

* Checkpoint

* Register change handler

* Fix bugs. Manually tested

* Checkstyle

* Tests

* Add pre-check that values are default

* Undo typo fix

* Update QueryRunner to make use of window function overflow handling server configurations (apache#16108)

* Add multistage thread limiting configs at the broker and server level (apache#16080)

* Adding changes for supporting RLS (apache#16043)

* Use stats cache on error instead of the chained mechanism (apache#15992)

* Improve broker error messaging when broker is the one reporting the failure (apache#16076)

* Introduce MSE active and passive timeouts (apache#16075)

* Throttle SSE & MSE Tasks if Server heap usage is above a threshold (apache#16271)

* Fix QueryScheduler constructor using class name. (apache#16280)

* Fix QueryScheduler constructor using class name.

* Fix test

* [Query Resource Isolation] WorkloadBudgetManager and Host enforcement (apache#15798)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* Remove singleton & signature fix

* Fix compatibility checker

* Review comments

* Move WorkloadBudgetManager to core.

---------

Co-authored-by: praveenc7 <praveenkchaganlal@gmail.com>

* Eliminate duplicate cancel attempts in PerQueryCPUMemAccountant (apache#16299)

* Add basic 1 query tests

* Add more tests

* Add ability to remember cancel queries.

* Clean up if conditions in killMostExpensiveQuery

* Fix test failures.

* Address review comments.

* Use QueryCancelCallback to cancel queries from ThreadResourceUsageAccountant (apache#16142)

* Remove all calls to System.gc() in PerQueryCPUMemAccountantFactory (apache#16374)

* Initialize thread accountant just before serving queries (apache#16326)

* Allow Reset of ThreadResourceUsageAccountant in Tracing.java (apache#16360)

* Queries now self terminate if in panic mode. (apache#16380)

* Queries now self terminate if in panic mode.

* Add config test

* Hard kill on critical level.

* Fix configs

* Separate anchor thread interruption.

* Checkstyle

* Review comments

* remove code for critical level

---------

Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>

* [Query Resource Isolation] Additonal Sampling for Broker and Server (apache#16164)

* fix

* sampling

* Broker sampling

* revert integ-test

* Fix test failures

* review comments

* remove MSE

* broker auth

* remove per pruner & planner sample

* Use Broker's accountant to sample in the request handler. (apache#16439)

* [Query Resource Isolation] Workload Scheduler (apache#16018)

* QRI - WorkloadBudgetManager implementation

* Address review comments

* scheduler

* unit test

* review comments: metrics, secondary, resource-manager

* remove broker admission

* Remove default budget

---------

Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>

* Cleanup deprecated methods in ThreadResourceUsageAccountant (apache#16479)

* Remove unnecessary methods and config for ThreadResourceUsageAccountant (apache#16490)

* Add tests for OOM Termination of MSE queries. (apache#16514)

* Fix a flaky assert when testing OOM Cancellation of MSE Queries (apache#16533)

* Disable Flaky Tests (apache#16554)

This is a follow-up to apache#16533
The fix for a flaky test did not work. This PR disables these tests temporarily.

* Use correlation ID instead of request id in PerQueryCpuMemAccountant (apache#16040)

* [Query Resource Isolation]Interface for Workload Stats Collection (apache#16340)

* Interface for Stats Collection

* Additional comments

* inherit

* additional class comments

* [Query Resource Isolation] Fix Refresh message (apache#16636)

* Fix Refresh message

* delete queryworkload message handler

* info -> debug logs

* reduce logging (apache#16698)

* style check

* [Query Workload Isolation] Cost-split support  (apache#16672)

* splits

* Cost split

* test

* propagation entity change & java doc

* Propagation scheme review comments

* empty commit to trigger build

* Reduce log for PerQueryCPUMemResourceUsageAccountant (apache#16642)

* [refactor] Switching to RoutingManager for broker request handlers (apache#16442)

Co-authored-by: Shaurya Chaturvedi <shauryachats@uber.com>

* Fix broker request id generator to avoid generating same id (apache#16661)

* Introduce QueryExecutionContext to manage query life cycle (apache#16728)

* Introduce QueryExecutionContext to manage query life cycle 2 (apache#16728)

---------

Co-authored-by: Rajat Venkatesh <1638298+vrajat@users.noreply.github.com>
Co-authored-by: Yash Mayya <yash.mayya@gmail.com>
Co-authored-by: Satwik Pachigolla <40644097+satwik-pachigolla@users.noreply.github.com>
Co-authored-by: 9aman <35227405+9aman@users.noreply.github.com>
Co-authored-by: Gonzalo Ortiz Jaureguizar <gortiz@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan <vvivekiyer@gmail.com>
Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com>
Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>
Co-authored-by: Shaurya Chaturvedi <shauryachats@gmail.com>
Co-authored-by: Shaurya Chaturvedi <shauryachats@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants