Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Concurrent Segment Search] Mechanism to have custom slice calculation for each shard #7358

Closed
sohami opened this issue May 2, 2023 · 3 comments
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@sohami
Copy link
Collaborator

sohami commented May 2, 2023

Currently lucene provides default mechanism of slice computation when concurrent model is used. It uses 5 segments or 250k documents as limits to assign segments to each slice. This slice computation defines number of concurrent tasks which a query phase will be executed with. This task is to provide mechanism to dynamically change the slice computation such that it can be customized for different workloads. For more context see #6798

@sohami sohami added enhancement Enhancement or improvement to existing feature or request untriaged labels May 2, 2023
@dbwiddis dbwiddis added Indexing & Search Search Search query, autocomplete ...etc and removed untriaged Indexing & Search labels May 12, 2023
@macohen macohen removed the Search Search query, autocomplete ...etc label May 15, 2023
@sohami
Copy link
Collaborator Author

sohami commented Jun 7, 2023

Created an issue in Lucene to support providing custom SliceExecutor by extensions of IndexSearcher. Ref PR

@sohami
Copy link
Collaborator Author

sohami commented Jul 20, 2023

Background:

For concurrent segment search, lucene by default computes the slices or number of tasks that can be executed concurrently for a search request. The default lucene implementation for slice computation uses the limit of 5 segments and 250K documents (whichever is met first) to decide on all the segments that will be processed per slice and hence also number of total slices that gets generated. For example: Consider an index foo which has 11 segments such that first 5 segments have 300K documents and last 6 segments have 1K document each. For this index, lucene will compute total of 7 slices such that first 5 slices will have 1 segment each (since they meet the 250K document count limit), next 5 segments will be in slice 6 (since they meet the segment count limit) and last segment will be in its own slice.

Problem Statement:

From the production usage we can see that usual shard size of the indices ranges between 20-40GB. Shard of this size can have segment counts ranging between 15-45 after the regular background merges are completed. Some segments will be of size > 1GiB, few will range in 100s of MiB whereas trailing ones will be in single digit MiBs. Due to large segment count and also large document count per segment, the number of slices computed by default lucene mechanism can be very large. For example: I used nyc_taxis dataset to generate a single shard index which is of size 23.9GB and that has 19 segments and the lucene computed slice count for this index is 15 per search request. It seems like Lucene tends to favor parallelizing a single shard as much as possible based on above parameters which can work well if a node is hosting that single shard or is not receiving multiple concurrent request. In those cases, each search request will have 15 slices to execute and that can create contention among the tasks across requests and increase the tail latency.

Proposal:

To deal with above problem, I am proposing a mechanism to control the maximum number of slices to be generated. We can start with a cluster level setting to configure that. Based on the set maximum slice, the segments for each request will be divided among those in round robin fashion after being sorted on doc counts. This will help to control the slices or concurrency per request depending upon the workload and use cases. For workload where search request concurrency is already high the lower value of slice count can be useful whereas with lower concurrency environment larger value of slice count can be useful. I did some benchmark based on this and below are results. This mechanism will also be useful to control the resource usage per request by controlling the slice count (or parallelization).

PoC implementation: https://github.com/sohami/OpenSearch/tree/cs-custom-slice

Caveats:

In Lucene 9.7, it doesn’t have a mechanism to dynamically control the slice computation at each request level. To support that I worked on this PR: apache/lucene#12374 which will be available in next release. Until the next release is out we can still provide the custom slice computation mechanism using a static setting for now. When we get the change in next lucene release we should be able to provide dynamic mechanism to control this slice count by the users and can also be extended as an index setting.

Setup:

  • OSB with nyc_taxis workload
  • 2 data nodes (r5.2xl) and 3 master nodes
  • 1 index with 1 shard and 0 replica - so all the requests goes to single node
  • 2 and 4 search_clients used to run the workload
  • I don’t have telemetry information since there is a bug with InMemoryStore. Ref But have captured CPU usage from EC2 instance
Run with 2 search clients
Operation Metrics lucene default (15 slices) Custom with slice 4 Slice 4 vs lucene default (-ve is better for latency and CPU) Custom with slice 6 Slice 6 vs lucene default (-ve is better for latency and CPU)
  CPU 99% 90% -9% 96% -3%
  Slice Count 15 4   6  
             
range            
  Mean Throughput 16.61 14.71 -11.43889 16.98 2.22757
  Max Throughput 17.15 15.15 -11.66181 17.55 2.33236
  50th %ile (ms) 107.447 124.56 15.92692 105.31 -1.98889
  90th %ile (ms) 119.523 132.813 11.1192 115.37 -3.47465
  99th %ile (ms) 143.473 150.612 4.97585 133.3 -7.09053
  100th %ile (ms) 155.585 157.593 1.29061 141.33 -9.16219
             
distance_amount_agg            
  Mean Throughput 0.32 0.34 5.88235 0.33 3.125
  Max Throughput 0.32 0.35 8.57143 0.33 3.125
  50th %ile (ms) 6195.98 5774.86 -7.2923 5959.21 -3.82135
  90th %ile (ms) 6510.54 6059.66 -7.44068 6262.92 -3.80337
  99th %ile (ms) 6610.33 6105.2 -8.27377 6576.07 -0.51828
  100th %ile (ms) 6713.37 6106.55 -9.9372 6980.08 3.97282
             
autohisto_agg            
  Mean Throughput 5.98 6.34 6.02007 6.01 0.50167
  Max Throughput 6.07 6.44 6.09555 6.14 1.15321
  50th %ile (ms) 323.183 303.662 -6.04023 321.118 -0.63896
  90th %ile (ms) 351.783 343.353 -2.39636 351.96 0.05032
  99th %ile (ms) 381.943 377.048 -1.2816 379.68 -0.5925
  100th %ile (ms) 391.511 391.248 -0.06718 405.72 3.62927
             
date_histogram_agg            
  Mean Throughput 5.76 6.28 9.02778 6.07 5.38194
  Max Throughput 5.82 6.31 8.41924 6.12 5.15464
  50th %ile (ms) 337.074 323.668 -3.97717 324.109 -3.84634
  90th %ile (ms) 380.65 349.947 -8.06594 358.519 -5.814
  99th %ile (ms) 416.307 373.393 -10.30826 389.995 -6.32034
  100th %ile (ms) 429.387 378.966 -11.74255 396.56 -7.64508
Run with 4 search clients
Operation Metrics CS Enabled with lucene default (15 slices) CS Enabled with slice 2 Slice 2 vs lucene default (% improvement) CS Enabled with slice 4 Slice 4 vs lucene default (% improvement)
  CPU 99% 96%   98%  
  Slice Count 15 2   4  
range            
  Mean Throughput 16.7 17.19 2.93413 17.01 1.85629
  Max Throughput 17.04 17.57 3.11033 17.5 2.69953
  50th %ile 225.61 216.77 -3.91827 217.25 -3.70551
  90th %ile 248.45 238.6 -3.96458 257.618 3.69008
  99th %ile 276.59 275.491 -0.39734 303.279 9.6493
  100th %ile 282.84 294.94 4.27804 339.97 20.1987
             
distance_amount_agg            
  Mean Throughput 0.29 0.36 24.13793 0.34 17.24138
  Max Throughput 0.29 0.36 24.13793 0.34 17.24138
  50th %ile 13585.2 11138.5 -18.01004 11837 -12.86842
  90th %ile 13674 11261.9 -17.64005 12979.7 -5.07752
  99th %ile 13774.3 11322 -17.80345 13911.3 0.99461
  100th %ile 14014.8 11419.5 -18.51828 14112.9 0.69997
             
autohisto_agg            
  Mean Throughput 7.11 7.88 10.82982 8.29 16.59634
  Max Throughput 7.22 8 10.80332 8.4 16.34349
  50th %ile 542.34 489.69 -9.70793 461.496 -14.90652
  90th %ile 564.62 544.13 -3.62899 529.06 -6.29804
  99th %ile 594.76 574.69 -3.37447 574.31 -3.43836
  100th %ile 612.36 611.02 -0.21883 594.22 -2.96231
             
date_histogram_agg            
  Mean Throughput 7.04 7.81 10.9375 8.06 14.48864
  Max Throughput 7.09 7.85 10.71932 8.11 14.38646
  50th %ile 557.81 503.083 -9.81105 487.88 -12.53653
  90th %ile 582.178 559.624 -3.87407 548.33 -5.81403
  99th %ile 620.33 581.851 -6.20299 610.53 -1.5798
  100th %ile 633.26 633.454 0.03064 627.16 -0.96327

Next Steps:

I only see FeatureFlag based mechanism currently to get the cluster static settings (please ignore the name for now) value using a static method. In the PoC I have created a similar class for this new search setting, was wondering if there is any better way to achieve this. Later the plan is to convert same setting to a dynamic one once lucene changes are available.

@sohami
Copy link
Collaborator Author

sohami commented Jul 20, 2023

@reta @andrross would like to get your feedback on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
Status: Done
Development

No branches or pull requests

4 participants