Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Concurrent Segment Search] Performance regression for multi-term aggs on high cardinality data #11584

Closed
jed326 opened this issue Dec 12, 2023 · 2 comments · Fixed by #11585
Closed
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@jed326
Copy link
Collaborator

jed326 commented Dec 12, 2023

When using concurrent segment search we can see a significant latency increase on some aggregation types. This is related to the changes made as a part of #9085.

This can be reproduced with the OpenSearch Benchmark http-logs workload when comparing latency with concurrent segment search enabled and disabled on this query.

By default the OpenSearch Benchmark workload will force merge the index down to 1 segment before running the search benchmarks. In this setup we would expect no differences between concurrent and non-concurrent search cases, however that is not the case.

Some basic numbers gathered from the query Profiler:

Metric logs-181998 - concurrent search disabled (in ms)  logs-181998 - slice=1  (in ms)
Took 1688   4913  
ConstantScoreQuery — time_in_nanos 118765435 118.76544 18212 0.01821
rewrite_time 11352   17096  
collector — time_in_nanos 1314681471 1314.68147 1917904003 1917.904
QueryCollectorManager — reduce_time_in_nanos     2896314317 2896.31432
MultiTermsAggregator — time_in_nanos 1517556622 1517.55662 4378650218 4378.65022
post_collection 2013   2103  
build_leaf_collector 51988   68327  
build_aggregation 298446165 298.44617 2670044927 2670.04493
collect 1219053379 1219.05338 1708531105 1708.53111
initialize 3077   3756

We see almost a 200% increase in query time taken between non-concurrent and concurrent cases and a 10x increase in build_aggregation time even though we expect these to be basically the same operation.

Looking at the async-profiler output we see that our query is actually spending 33% of the time on a PriorityQueue.pop call:
Screenshot 2023-12-11 at 4 37 02 PM

Which is happening here:

for (int b = ordered.size() - 1; b >= 0; --b) {
topBucketsPerOrd[ordIdx][b] = ordered.pop();
otherDocCounts[ordIdx] -= topBucketsPerOrd[ordIdx][b].getDocCount();
}

Using Arthas, we can see that the problem is the priority queue ordered is being created with size 503910 so then we are calling pop on the PriorityQueue that many times when creating the InternalAggregation object.
arthas-watch-concurrent.txt

This ultimately relates back to #9085 where we made a change to not enforce the shard_min_doc_count and shard_size parameters on the slice level, so now during buid_aggregation we are creating a PriorityQueue of basically unbounded size.

I have 2 possible solutions for this, which I will add more details on below.

@jed326 jed326 self-assigned this Dec 12, 2023
@jed326 jed326 added enhancement Enhancement or improvement to existing feature or request Search:Performance and removed untriaged labels Dec 12, 2023
@jed326
Copy link
Collaborator Author

jed326 commented Dec 12, 2023

Solutions

1. Enforce a heuristic "slice_size" at the slice level (RECOMMENDED)

We can use the same shard_size -> size heuristic and use 1.5 * shard_size + 10 as the size bound for concurrent search case. Both the size and shard_size parameters already need to be chosen to account for the cardinality of the data, regardless of concurrent vs non-concurrent search, otherwise users will face the same issue of partial doc counts. We can use the same heuristic for the slice level limit and users will still be able to use the same shard_size parameter to tune their results. This will add an upper bound on the priority queue and eliminate this latency problem, adding some basic benchmarking data below (I used slice_size = shard_size * 10 for these benchmarks instead of slice_size = 1.5 * shard_size + 10):

Metric slogs-181998 - concurrent search disabled (in ms) slogs-181998 - slice=2 (in ms) slogs-181998 - slice=2, slice_size heuristic applied (in ms)
Took 1901   7068   1404  
ConstantScoreQuery — time_in_nanos 135889871 135.88987 12628   18181  
rewrite_time 9358   9828   25639  
collector — time_in_nanos 1515666717 1515.66672 884344233 884.34423 834765726 834.76573
QueryCollectorManager — reduce_time_in_nanos     6139270406 6139.27041 528577668 528.57767
MultiTermsAggregator — time_in_nanos 1704759679 1704.75968 3683312840 3683.31284 1014621634 1014.62163
post_collection 2341   115390928 115.39093 126565026 126.56503
build_leaf_collector 871828 0.87183 396289 0.39629 611366 0.61137
build_aggregation 310218243 310.21824 5571512376 5571.51238 527689677 527.68968
collect 1393664295 1393.6643 800511158 800.51116 743557534 743.55753
initialize 2972   184515 0.18452 296333 0.29633

2. Merge aggregators before buildAggregation

In order to avoid eliminating any buckets at the slice level without running into this performance bottleneck, we would need to merge the Aggregator objects here before building the InternalAggregations objects

@Override
public ReduceableSearchResult reduce(Collection<Collector> collectors) throws IOException {
final List<Aggregator> aggregators = context.bucketCollectorProcessor().toAggregators(collectors);
final List<InternalAggregation> internals = new ArrayList<>(aggregators.size());
context.aggregations().resetBucketMultiConsumer();
for (Aggregator aggregator : aggregators) {
try {
// post collection is called in ContextIndexSearcher after search on leaves are completed
internals.add(aggregator.buildTopLevel());
} catch (IOException e) {
throw new AggregationExecutionException("Failed to build aggregation [" + aggregator.name() + "]", e);
}
}
final InternalAggregations internalAggregations = InternalAggregations.from(internals);
return buildAggregationResult(internalAggregations);
}

This is pretty challenging for 2 reasons.

  1. bucketOrds is implemented by each Aggregator object rather than in any parent class, so each Aggregator itself must implement it's own merging logic. Moreover, depending on how the subAggs are support for each aggregation type it would not be sufficient to have this merge operation be a no-op for any unaffected Aggregators. This means more or less we are rewriting the entire slice level reduce process (I think this is actually why we are re-using reduce in the first place). At a high level, today we do buildAggregation for the entire collector tree and then reduce for the entire collector tree, and we would need to change this to do reduce then buildAggregation for each level of the collector tree.
  2. There is both DFS and BFS collection modes for aggregators. For DFS mode there shouldn't be any problem with the merge since all collection will be done before we reach buildAggregation. However, for BFS modes depending on how the subAggregators implement collect, we could be in a situation where documents have not been collected yet by the time we are doing buildAggregation.

For these reasons I think solution 1 is a much better approach and I will follow-up with a CR with that implementation.

@reta
Copy link
Collaborator

reta commented Dec 12, 2023

For these reasons I think solution 1 is a much better approach and I will follow-up with a CR with that implementation.

Thanks @jed326 , I think is reasonable fix to try

@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 labels Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants