[Concurrent Segment Search] Performance regression for multi-term aggs on high cardinality data #11584

jed326 · 2023-12-12T00:47:56Z

When using concurrent segment search we can see a significant latency increase on some aggregation types. This is related to the changes made as a part of #9085.

This can be reproduced with the OpenSearch Benchmark http-logs workload when comparing latency with concurrent segment search enabled and disabled on this query.

By default the OpenSearch Benchmark workload will force merge the index down to 1 segment before running the search benchmarks. In this setup we would expect no differences between concurrent and non-concurrent search cases, however that is not the case.

Some basic numbers gathered from the query Profiler:

Metric	logs-181998 - concurrent search disabled	(in ms)	logs-181998 - slice=1	(in ms)
Took	1688		4913
ConstantScoreQuery — time_in_nanos	118765435	118.76544	18212	0.01821
rewrite_time	11352		17096
collector — time_in_nanos	1314681471	1314.68147	1917904003	1917.904
QueryCollectorManager — reduce_time_in_nanos			2896314317	2896.31432
MultiTermsAggregator — time_in_nanos	1517556622	1517.55662	4378650218	4378.65022
post_collection	2013		2103
build_leaf_collector	51988		68327
build_aggregation	298446165	298.44617	2670044927	2670.04493
collect	1219053379	1219.05338	1708531105	1708.53111
initialize	3077		3756

We see almost a 200% increase in query time taken between non-concurrent and concurrent cases and a 10x increase in build_aggregation time even though we expect these to be basically the same operation.

Looking at the async-profiler output we see that our query is actually spending 33% of the time on a PriorityQueue.pop call:

Which is happening here:

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregator.java

Lines 161 to 164 in 2798114

    
           for (int b = ordered.size() - 1; b >= 0; --b) { 
        
               topBucketsPerOrd[ordIdx][b] = ordered.pop(); 
        
               otherDocCounts[ordIdx] -= topBucketsPerOrd[ordIdx][b].getDocCount(); 
        
           }

Using Arthas, we can see that the problem is the priority queue ordered is being created with size 503910 so then we are calling pop on the PriorityQueue that many times when creating the InternalAggregation object.
arthas-watch-concurrent.txt

This ultimately relates back to #9085 where we made a change to not enforce the shard_min_doc_count and shard_size parameters on the slice level, so now during buid_aggregation we are creating a PriorityQueue of basically unbounded size.

I have 2 possible solutions for this, which I will add more details on below.

The text was updated successfully, but these errors were encountered:

jed326 · 2023-12-12T01:16:48Z

Solutions

1. Enforce a heuristic "slice_size" at the slice level (RECOMMENDED)

We can use the same shard_size -> size heuristic and use 1.5 * shard_size + 10 as the size bound for concurrent search case. Both the size and shard_size parameters already need to be chosen to account for the cardinality of the data, regardless of concurrent vs non-concurrent search, otherwise users will face the same issue of partial doc counts. We can use the same heuristic for the slice level limit and users will still be able to use the same shard_size parameter to tune their results. This will add an upper bound on the priority queue and eliminate this latency problem, adding some basic benchmarking data below (I used slice_size = shard_size * 10 for these benchmarks instead of slice_size = 1.5 * shard_size + 10):

Metric	slogs-181998 - concurrent search disabled	(in ms)	slogs-181998 - slice=2	(in ms)	slogs-181998 - slice=2, slice_size heuristic applied	(in ms)
Took	1901		7068		1404
ConstantScoreQuery — time_in_nanos	135889871	135.88987	12628		18181
rewrite_time	9358		9828		25639
collector — time_in_nanos	1515666717	1515.66672	884344233	884.34423	834765726	834.76573
QueryCollectorManager — reduce_time_in_nanos			6139270406	6139.27041	528577668	528.57767
MultiTermsAggregator — time_in_nanos	1704759679	1704.75968	3683312840	3683.31284	1014621634	1014.62163
post_collection	2341		115390928	115.39093	126565026	126.56503
build_leaf_collector	871828	0.87183	396289	0.39629	611366	0.61137
build_aggregation	310218243	310.21824	5571512376	5571.51238	527689677	527.68968
collect	1393664295	1393.6643	800511158	800.51116	743557534	743.55753
initialize	2972		184515	0.18452	296333	0.29633

2. Merge aggregators before `buildAggregation`

In order to avoid eliminating any buckets at the slice level without running into this performance bottleneck, we would need to merge the Aggregator objects here before building the InternalAggregations objects

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/AggregationCollectorManager.java

Lines 57 to 73 in 2798114

    
           @Override 
        
           public ReduceableSearchResult reduce(Collection<Collector> collectors) throws IOException { 
        
               final List<Aggregator> aggregators = context.bucketCollectorProcessor().toAggregators(collectors); 
        
               final List<InternalAggregation> internals = new ArrayList<>(aggregators.size()); 
        
               context.aggregations().resetBucketMultiConsumer(); 
        
               for (Aggregator aggregator : aggregators) { 
        
                   try { 
        
                       // post collection is called in ContextIndexSearcher after search on leaves are completed 
        
                       internals.add(aggregator.buildTopLevel()); 
        
                   } catch (IOException e) { 
        
                       throw new AggregationExecutionException("Failed to build aggregation [" + aggregator.name() + "]", e); 
        
                   } 
        
               } 
        
               final InternalAggregations internalAggregations = InternalAggregations.from(internals); 
        
               return buildAggregationResult(internalAggregations); 
        
           }

This is pretty challenging for 2 reasons.

bucketOrds is implemented by each Aggregator object rather than in any parent class, so each Aggregator itself must implement it's own merging logic. Moreover, depending on how the subAggs are support for each aggregation type it would not be sufficient to have this merge operation be a no-op for any unaffected Aggregators. This means more or less we are rewriting the entire slice level reduce process (I think this is actually why we are re-using reduce in the first place). At a high level, today we do buildAggregation for the entire collector tree and then reduce for the entire collector tree, and we would need to change this to do reduce then buildAggregation for each level of the collector tree.
There is both DFS and BFS collection modes for aggregators. For DFS mode there shouldn't be any problem with the merge since all collection will be done before we reach buildAggregation. However, for BFS modes depending on how the subAggregators implement collect, we could be in a situation where documents have not been collected yet by the time we are doing buildAggregation.

For these reasons I think solution 1 is a much better approach and I will follow-up with a CR with that implementation.

reta · 2023-12-12T15:08:27Z

For these reasons I think solution 1 is a much better approach and I will follow-up with a CR with that implementation.

Thanks @jed326 , I think is reasonable fix to try

jed326 self-assigned this Dec 12, 2023

github-actions bot added the untriaged label Dec 12, 2023

jed326 added enhancement Enhancement or improvement to existing feature or request Search:Performance and removed untriaged labels Dec 12, 2023

jed326 mentioned this issue Dec 12, 2023

Add heuristic slice_size for slice level top buckets in concurrent segment search #11585

Merged

8 tasks

sohami closed this as completed in #11585 Dec 12, 2023

reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 labels Dec 14, 2023

jed326 mentioned this issue Jan 10, 2024

[Concurrent Segment Search] shard_min_doc_count is not enforced at the slice level #11847

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Concurrent Segment Search] Performance regression for multi-term aggs on high cardinality data #11584

[Concurrent Segment Search] Performance regression for multi-term aggs on high cardinality data #11584

jed326 commented Dec 12, 2023

jed326 commented Dec 12, 2023 •

edited

Loading

reta commented Dec 12, 2023

[Concurrent Segment Search] Performance regression for multi-term aggs on high cardinality data #11584

[Concurrent Segment Search] Performance regression for multi-term aggs on high cardinality data #11584

Comments

jed326 commented Dec 12, 2023

jed326 commented Dec 12, 2023 • edited Loading

Solutions

1. Enforce a heuristic "slice_size" at the slice level (RECOMMENDED)

2. Merge aggregators before buildAggregation

reta commented Dec 12, 2023

jed326 commented Dec 12, 2023 •

edited

Loading

2. Merge aggregators before `buildAggregation`