[Query Resource Isolation] Additonal Sampling for Broker and Server#16164
[Query Resource Isolation] Additonal Sampling for Broker and Server#16164vvivekiyer merged 12 commits intoapache:masterfrom
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #16164 +/- ##
============================================
+ Coverage 62.90% 63.30% +0.40%
+ Complexity 1386 1363 -23
============================================
Files 2867 2976 +109
Lines 163354 172937 +9583
Branches 24952 26497 +1545
============================================
+ Hits 102755 109485 +6730
- Misses 52847 55083 +2236
- Partials 7752 8369 +617
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
vvivekiyer
left a comment
There was a problem hiding this comment.
@pchaganl In our deployment we noticed a few other misses (eg: Distinct Queries, PercentileTDigest). Can we make sure to add sampling there too?
| _brokerMetrics.addPhaseTiming(rawTableName, BrokerQueryPhase.QUERY_ROUTING, | ||
| routingEndTimeNs - routingStartTimeNs); | ||
| // Account the resource used for routing phase | ||
| Tracing.ThreadAccountantOps.sample(); |
There was a problem hiding this comment.
Let's add a sample at the end of every phase:
- Request Compilation
- Authorization
- Routing
Also add a comment here as to why sampling for routing is important - with single threaded segment pruners, this can take resources when there are a lot of segments.
There was a problem hiding this comment.
I have added for Request Compilation & Routing. I did consider for authorization and looked into our latency metric for authorization it is less than 0.1 ms . So avoided it
There was a problem hiding this comment.
(nit) Prefer moving the sampling to the location where we compute phase timing for each phase. That way it's evident that we want to sample each broker phase.
There was a problem hiding this comment.
Routing was closer to the phase calculation moved compilation as well
There was a problem hiding this comment.
Let's add authorization. The idea is that if there's a bug/deployment issue in this phase, we catch it.
| compileRequest(requestId, query, sqlNodeAndOptions, request, requesterIdentity, requestContext, httpHeaders, | ||
| accessControl); | ||
| // Accounts for resource usage of the compilation phase | ||
| Tracing.ThreadAccountantOps.sample(); |
There was a problem hiding this comment.
Can we also make this change for the other handlers?
There was a problem hiding this comment.
Currently it is added for SSE and MSE.
GrpcBrokerRequestHandler, SingleConnectionBrokerRequestHandler extend this handler for SSE. TimeSeriesRequestHandler seems to be custom handler.
For MSE it is added in QueryEnvironment
| public PlanNode makeSegmentPlanNode(SegmentContext segmentContext, QueryContext queryContext) { | ||
| rewriteQueryContextWithHints(queryContext, segmentContext.getIndexSegment()); | ||
| // Sample to track usage of query planning | ||
| Tracing.ThreadAccountantOps.sample(); |
There was a problem hiding this comment.
Can we use sampleAndCheckInterruption() in all these places to make sure the thread exists if it's already been interrupted?
There was a problem hiding this comment.
I see the usage of sampleAndCheckInterruption() mostly on broker on merge. I taught there was reason for doing that specifically on broker.
Is sampleAndCheckInterruption() a more evolved approach we want to move on all places on server as well?
There was a problem hiding this comment.
You should see it used on server as well in some of the operators. The only difference is one blindly samples. While the other checks for interruption and throws exception before sampling.
| } | ||
| RelRoot relRoot = compileQuery(queryNode, plannerContext); | ||
| // Accounts for resource usage of the compilation phase | ||
| Tracing.ThreadAccountantOps.sampleMSE(); |
There was a problem hiding this comment.
setupRunner is called much later before dispatch in MultistageBrokerRequestHandler. Two questions here:
- It will not work for multistage handlers.
- If this is for the single stage handler fallback path, is this really necessary if we already are sampling after the compilation phase?
There was a problem hiding this comment.
- You are right it was setup after compilation, moved it up, take a look if looks right?
- This path is touched by MSE handler so it is not added as a fallback
There was a problem hiding this comment.
It doesn't look right to me. The QueryEnv.compile() happens in an async call with a separate threadpool. We might have to find another way to account for these async calls in MSE.
| @Override | ||
| public PlanNode makeSegmentPlanNode(SegmentContext segmentContext, QueryContext queryContext) { | ||
| rewriteQueryContextWithHints(queryContext, segmentContext.getIndexSegment()); | ||
| // Sample to track usage of query planning |
There was a problem hiding this comment.
We need to track 2 things on the server right?
- Query Planning
- Segment Pruning
Can you make sure both are tracked?
There was a problem hiding this comment.
Added for Segment pruning
|
|
||
| try (QueryEnvironment.CompiledQuery compiledQuery = | ||
| compileQuery(requestId, query, sqlNodeAndOptions, httpHeaders, queryTimer)) { | ||
|
|
There was a problem hiding this comment.
(nit) remove spurious changes.
There was a problem hiding this comment.
This was because of a bad merge with master, fixed it
| public PlanNode makeSegmentPlanNode(SegmentContext segmentContext, QueryContext queryContext) { | ||
| rewriteQueryContextWithHints(queryContext, segmentContext.getIndexSegment()); | ||
| // Sample to track usage of query planning, since it can be expensive for large segment lists. | ||
| Tracing.ThreadAccountantOps.sampleAndCheckInterruption(); |
There was a problem hiding this comment.
Keep all the sampling in a single place without moving it inside the planning/pruning classes.. That way it's easier to reason about it.
For example: can we add both the pruning and planning sampling in ServerQueryExecutorV1Impl.
- Pruning sampling in
executeInternal()after the selectedSegmentsInfo is computed. - Planning in
executeQuerybefore theplanCombineQueryis called.
There was a problem hiding this comment.
I understand that consolidating the sampling into a single location can improve readability, but in this case it would come at the cost of precision—because you’d only capture metrics once, after the entire operation completes. For example, we have several pruner and ideally need to record a sample after each one to maintain accuracy (and the same applies to planning).
* [Query Resource Isolation] Workload Configs (apache#15109) * Workload Configs * workload config * Add API * config * Change config structure * Propagation strategy * Fix style check * Cost spliting on update * Table addition propagation * perf * Tests * test * test 2 * Review comments 1 * review comments 3 * review comments 3 * name change * review comments 4 * Fix TableDoesNotExistError for hybrid tables in MSE queries in controller API (apache#16102) * Make ThreadResourceUsageProvider a Helper/Utility Class. (apache#16051) * ThreadResourceUsageProvider is a helper class. ThreadResourceContext tracks resource usage. Fix updateConcurrently * Rename to ThreadResourceSnapshot * Clean up * Add javadoc * Done use auto closeable * Checkstyle * Fix compilation error * Add back removed functions in SPI * Remove private constructor because japicmp complains. * Add setThreadResourceUsageProvider because of backward-incompatible checks * Add setThreadResourceUsageProvider because of backward-incompatible checks * Fix test * Fix ThreadResourceSnapshot usage and tests * Store cpu sample in nanoseconds. * Reduce logs and improve logging when queries are terminated due to OOM. (apache#16172) * Dynamic PerQueryCPUMemAccountant Config on Servers (apache#16219) * Checkpoint * Register change handler * Fix bugs. Manually tested * Checkstyle * Tests * Add pre-check that values are default * Undo typo fix * Update QueryRunner to make use of window function overflow handling server configurations (apache#16108) * Add multistage thread limiting configs at the broker and server level (apache#16080) * Adding changes for supporting RLS (apache#16043) * Use stats cache on error instead of the chained mechanism (apache#15992) * Improve broker error messaging when broker is the one reporting the failure (apache#16076) * Introduce MSE active and passive timeouts (apache#16075) * Throttle SSE & MSE Tasks if Server heap usage is above a threshold (apache#16271) * Fix QueryScheduler constructor using class name. (apache#16280) * Fix QueryScheduler constructor using class name. * Fix test * [Query Resource Isolation] WorkloadBudgetManager and Host enforcement (apache#15798) * QRI - WorkloadBudgetManager implementation * Address review comments * Remove singleton & signature fix * Fix compatibility checker * Review comments * Move WorkloadBudgetManager to core. --------- Co-authored-by: praveenc7 <praveenkchaganlal@gmail.com> * Eliminate duplicate cancel attempts in PerQueryCPUMemAccountant (apache#16299) * Add basic 1 query tests * Add more tests * Add ability to remember cancel queries. * Clean up if conditions in killMostExpensiveQuery * Fix test failures. * Address review comments. * Use QueryCancelCallback to cancel queries from ThreadResourceUsageAccountant (apache#16142) * Remove all calls to System.gc() in PerQueryCPUMemAccountantFactory (apache#16374) * Initialize thread accountant just before serving queries (apache#16326) * Allow Reset of ThreadResourceUsageAccountant in Tracing.java (apache#16360) * Queries now self terminate if in panic mode. (apache#16380) * Queries now self terminate if in panic mode. * Add config test * Hard kill on critical level. * Fix configs * Separate anchor thread interruption. * Checkstyle * Review comments * remove code for critical level --------- Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com> * [Query Resource Isolation] Additonal Sampling for Broker and Server (apache#16164) * fix * sampling * Broker sampling * revert integ-test * Fix test failures * review comments * remove MSE * broker auth * remove per pruner & planner sample * Use Broker's accountant to sample in the request handler. (apache#16439) * [Query Resource Isolation] Workload Scheduler (apache#16018) * QRI - WorkloadBudgetManager implementation * Address review comments * scheduler * unit test * review comments: metrics, secondary, resource-manager * remove broker admission * Remove default budget --------- Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com> * Cleanup deprecated methods in ThreadResourceUsageAccountant (apache#16479) * Remove unnecessary methods and config for ThreadResourceUsageAccountant (apache#16490) * Add tests for OOM Termination of MSE queries. (apache#16514) * Fix a flaky assert when testing OOM Cancellation of MSE Queries (apache#16533) * Disable Flaky Tests (apache#16554) This is a follow-up to apache#16533 The fix for a flaky test did not work. This PR disables these tests temporarily. * Use correlation ID instead of request id in PerQueryCpuMemAccountant (apache#16040) * [Query Resource Isolation]Interface for Workload Stats Collection (apache#16340) * Interface for Stats Collection * Additional comments * inherit * additional class comments * [Query Resource Isolation] Fix Refresh message (apache#16636) * Fix Refresh message * delete queryworkload message handler * info -> debug logs * reduce logging (apache#16698) * style check * [Query Workload Isolation] Cost-split support (apache#16672) * splits * Cost split * test * propagation entity change & java doc * Propagation scheme review comments * empty commit to trigger build * Reduce log for PerQueryCPUMemResourceUsageAccountant (apache#16642) --------- Co-authored-by: Rajat Venkatesh <1638298+vrajat@users.noreply.github.com> Co-authored-by: Yash Mayya <yash.mayya@gmail.com> Co-authored-by: Satwik Pachigolla <40644097+satwik-pachigolla@users.noreply.github.com> Co-authored-by: 9aman <35227405+9aman@users.noreply.github.com> Co-authored-by: Gonzalo Ortiz Jaureguizar <gortiz@users.noreply.github.com> Co-authored-by: Vivek Iyer Vaidyanathan <vvivekiyer@gmail.com> Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com> Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com> Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com>
* [Query Resource Isolation] Workload Configs (apache#15109) * Workload Configs * workload config * Add API * config * Change config structure * Propagation strategy * Fix style check * Cost spliting on update * Table addition propagation * perf * Tests * test * test 2 * Review comments 1 * review comments 3 * review comments 3 * name change * review comments 4 * Fix TableDoesNotExistError for hybrid tables in MSE queries in controller API (apache#16102) * Make ThreadResourceUsageProvider a Helper/Utility Class. (apache#16051) * ThreadResourceUsageProvider is a helper class. ThreadResourceContext tracks resource usage. Fix updateConcurrently * Rename to ThreadResourceSnapshot * Clean up * Add javadoc * Done use auto closeable * Checkstyle * Fix compilation error * Add back removed functions in SPI * Remove private constructor because japicmp complains. * Add setThreadResourceUsageProvider because of backward-incompatible checks * Add setThreadResourceUsageProvider because of backward-incompatible checks * Fix test * Fix ThreadResourceSnapshot usage and tests * Store cpu sample in nanoseconds. * Reduce logs and improve logging when queries are terminated due to OOM. (apache#16172) * Dynamic PerQueryCPUMemAccountant Config on Servers (apache#16219) * Checkpoint * Register change handler * Fix bugs. Manually tested * Checkstyle * Tests * Add pre-check that values are default * Undo typo fix * Update QueryRunner to make use of window function overflow handling server configurations (apache#16108) * Add multistage thread limiting configs at the broker and server level (apache#16080) * Adding changes for supporting RLS (apache#16043) * Use stats cache on error instead of the chained mechanism (apache#15992) * Improve broker error messaging when broker is the one reporting the failure (apache#16076) * Introduce MSE active and passive timeouts (apache#16075) * Throttle SSE & MSE Tasks if Server heap usage is above a threshold (apache#16271) * Fix QueryScheduler constructor using class name. (apache#16280) * Fix QueryScheduler constructor using class name. * Fix test * [Query Resource Isolation] WorkloadBudgetManager and Host enforcement (apache#15798) * QRI - WorkloadBudgetManager implementation * Address review comments * Remove singleton & signature fix * Fix compatibility checker * Review comments * Move WorkloadBudgetManager to core. --------- Co-authored-by: praveenc7 <praveenkchaganlal@gmail.com> * Eliminate duplicate cancel attempts in PerQueryCPUMemAccountant (apache#16299) * Add basic 1 query tests * Add more tests * Add ability to remember cancel queries. * Clean up if conditions in killMostExpensiveQuery * Fix test failures. * Address review comments. * Use QueryCancelCallback to cancel queries from ThreadResourceUsageAccountant (apache#16142) * Remove all calls to System.gc() in PerQueryCPUMemAccountantFactory (apache#16374) * Initialize thread accountant just before serving queries (apache#16326) * Allow Reset of ThreadResourceUsageAccountant in Tracing.java (apache#16360) * Queries now self terminate if in panic mode. (apache#16380) * Queries now self terminate if in panic mode. * Add config test * Hard kill on critical level. * Fix configs * Separate anchor thread interruption. * Checkstyle * Review comments * remove code for critical level --------- Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com> * [Query Resource Isolation] Additonal Sampling for Broker and Server (apache#16164) * fix * sampling * Broker sampling * revert integ-test * Fix test failures * review comments * remove MSE * broker auth * remove per pruner & planner sample * Use Broker's accountant to sample in the request handler. (apache#16439) * [Query Resource Isolation] Workload Scheduler (apache#16018) * QRI - WorkloadBudgetManager implementation * Address review comments * scheduler * unit test * review comments: metrics, secondary, resource-manager * remove broker admission * Remove default budget --------- Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com> * Cleanup deprecated methods in ThreadResourceUsageAccountant (apache#16479) * Remove unnecessary methods and config for ThreadResourceUsageAccountant (apache#16490) * Add tests for OOM Termination of MSE queries. (apache#16514) * Fix a flaky assert when testing OOM Cancellation of MSE Queries (apache#16533) * Disable Flaky Tests (apache#16554) This is a follow-up to apache#16533 The fix for a flaky test did not work. This PR disables these tests temporarily. * Use correlation ID instead of request id in PerQueryCpuMemAccountant (apache#16040) * [Query Resource Isolation]Interface for Workload Stats Collection (apache#16340) * Interface for Stats Collection * Additional comments * inherit * additional class comments * [Query Resource Isolation] Fix Refresh message (apache#16636) * Fix Refresh message * delete queryworkload message handler * info -> debug logs * reduce logging (apache#16698) * style check * [Query Workload Isolation] Cost-split support (apache#16672) * splits * Cost split * test * propagation entity change & java doc * Propagation scheme review comments * empty commit to trigger build * Reduce log for PerQueryCPUMemResourceUsageAccountant (apache#16642) * [refactor] Switching to RoutingManager for broker request handlers (apache#16442) Co-authored-by: Shaurya Chaturvedi <shauryachats@uber.com> * Fix broker request id generator to avoid generating same id (apache#16661) * Introduce QueryExecutionContext to manage query life cycle (apache#16728) * Introduce QueryExecutionContext to manage query life cycle 2 (apache#16728) --------- Co-authored-by: Rajat Venkatesh <1638298+vrajat@users.noreply.github.com> Co-authored-by: Yash Mayya <yash.mayya@gmail.com> Co-authored-by: Satwik Pachigolla <40644097+satwik-pachigolla@users.noreply.github.com> Co-authored-by: 9aman <35227405+9aman@users.noreply.github.com> Co-authored-by: Gonzalo Ortiz Jaureguizar <gortiz@users.noreply.github.com> Co-authored-by: Vivek Iyer Vaidyanathan <vvivekiyer@gmail.com> Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com> Co-authored-by: Rajat Venkatesh <vrajat@users.noreply.github.com> Co-authored-by: Vivek Iyer Vaidyanathan Iyer <vvaidyanathan@linkedin.com> Co-authored-by: Shaurya Chaturvedi <shauryachats@gmail.com> Co-authored-by: Shaurya Chaturvedi <shauryachats@uber.com>
Summary
Query-Resource-Isolation (QRI) relies on periodic CPU / memory samples to detect “run-away” queries and enforce per-workload budgets.
In several incidents we found that no samples are collected during three cost-intensive phases, so large queries sometimes survive far longer than the budget window:
This PR addresses those gaps by introducing lightweight sampling hooks at the end of each phase on the Broker side.
On the Server, sampling is added at the per-segment plan level.
These hooks are enabled by default since the overhead is minimal:
• On the Broker, sampling incurs only a single MXBean call.
• On the Server, sampling is bounded by the number of segments planned. For valid queries, P95th of queries typically involves fewer than 100 segments per server, with a P50 around 45 segments.
Testing Done
Manual quick start and integ test to validate if sampling is happening during this path.