[Concurrent Segment Search] Mechanism to have custom slice calculation for each shard #7358

sohami · 2023-05-02T03:56:42Z

Currently lucene provides default mechanism of slice computation when concurrent model is used. It uses 5 segments or 250k documents as limits to assign segments to each slice. This slice computation defines number of concurrent tasks which a query phase will be executed with. This task is to provide mechanism to dynamically change the slice computation such that it can be customized for different workloads. For more context see #6798

sohami · 2023-06-07T05:30:03Z

Created an issue in Lucene to support providing custom SliceExecutor by extensions of IndexSearcher. Ref PR

sohami · 2023-07-20T16:49:29Z

Background:

For concurrent segment search, lucene by default computes the slices or number of tasks that can be executed concurrently for a search request. The default lucene implementation for slice computation uses the limit of 5 segments and 250K documents (whichever is met first) to decide on all the segments that will be processed per slice and hence also number of total slices that gets generated. For example: Consider an index foo which has 11 segments such that first 5 segments have 300K documents and last 6 segments have 1K document each. For this index, lucene will compute total of 7 slices such that first 5 slices will have 1 segment each (since they meet the 250K document count limit), next 5 segments will be in slice 6 (since they meet the segment count limit) and last segment will be in its own slice.

Problem Statement:

From the production usage we can see that usual shard size of the indices ranges between 20-40GB. Shard of this size can have segment counts ranging between 15-45 after the regular background merges are completed. Some segments will be of size > 1GiB, few will range in 100s of MiB whereas trailing ones will be in single digit MiBs. Due to large segment count and also large document count per segment, the number of slices computed by default lucene mechanism can be very large. For example: I used nyc_taxis dataset to generate a single shard index which is of size 23.9GB and that has 19 segments and the lucene computed slice count for this index is 15 per search request. It seems like Lucene tends to favor parallelizing a single shard as much as possible based on above parameters which can work well if a node is hosting that single shard or is not receiving multiple concurrent request. In those cases, each search request will have 15 slices to execute and that can create contention among the tasks across requests and increase the tail latency.

Proposal:

To deal with above problem, I am proposing a mechanism to control the maximum number of slices to be generated. We can start with a cluster level setting to configure that. Based on the set maximum slice, the segments for each request will be divided among those in round robin fashion after being sorted on doc counts. This will help to control the slices or concurrency per request depending upon the workload and use cases. For workload where search request concurrency is already high the lower value of slice count can be useful whereas with lower concurrency environment larger value of slice count can be useful. I did some benchmark based on this and below are results. This mechanism will also be useful to control the resource usage per request by controlling the slice count (or parallelization).

PoC implementation: https://github.com/sohami/OpenSearch/tree/cs-custom-slice

Caveats:

In Lucene 9.7, it doesn’t have a mechanism to dynamically control the slice computation at each request level. To support that I worked on this PR: apache/lucene#12374 which will be available in next release. Until the next release is out we can still provide the custom slice computation mechanism using a static setting for now. When we get the change in next lucene release we should be able to provide dynamic mechanism to control this slice count by the users and can also be extended as an index setting.

Setup:

OSB with nyc_taxis workload
2 data nodes (r5.2xl) and 3 master nodes
1 index with 1 shard and 0 replica - so all the requests goes to single node
2 and 4 search_clients used to run the workload
I don’t have telemetry information since there is a bug with InMemoryStore. Ref But have captured CPU usage from EC2 instance

Run with 2 search clients

Operation	Metrics	lucene default (15 slices)	Custom with slice 4	Slice 4 vs lucene default (-ve is better for latency and CPU)	Custom with slice 6	Slice 6 vs lucene default (-ve is better for latency and CPU)
	CPU	99%	90%	-9%	96%	-3%
	Slice Count	15	4		6

range
	Mean Throughput	16.61	14.71	-11.43889	16.98	2.22757
	Max Throughput	17.15	15.15	-11.66181	17.55	2.33236
	50th %ile (ms)	107.447	124.56	15.92692	105.31	-1.98889
	90th %ile (ms)	119.523	132.813	11.1192	115.37	-3.47465
	99th %ile (ms)	143.473	150.612	4.97585	133.3	-7.09053
	100th %ile (ms)	155.585	157.593	1.29061	141.33	-9.16219

distance_amount_agg
	Mean Throughput	0.32	0.34	5.88235	0.33	3.125
	Max Throughput	0.32	0.35	8.57143	0.33	3.125
	50th %ile (ms)	6195.98	5774.86	-7.2923	5959.21	-3.82135
	90th %ile (ms)	6510.54	6059.66	-7.44068	6262.92	-3.80337
	99th %ile (ms)	6610.33	6105.2	-8.27377	6576.07	-0.51828
	100th %ile (ms)	6713.37	6106.55	-9.9372	6980.08	3.97282

autohisto_agg
	Mean Throughput	5.98	6.34	6.02007	6.01	0.50167
	Max Throughput	6.07	6.44	6.09555	6.14	1.15321
	50th %ile (ms)	323.183	303.662	-6.04023	321.118	-0.63896
	90th %ile (ms)	351.783	343.353	-2.39636	351.96	0.05032
	99th %ile (ms)	381.943	377.048	-1.2816	379.68	-0.5925
	100th %ile (ms)	391.511	391.248	-0.06718	405.72	3.62927

date_histogram_agg
	Mean Throughput	5.76	6.28	9.02778	6.07	5.38194
	Max Throughput	5.82	6.31	8.41924	6.12	5.15464
	50th %ile (ms)	337.074	323.668	-3.97717	324.109	-3.84634
	90th %ile (ms)	380.65	349.947	-8.06594	358.519	-5.814
	99th %ile (ms)	416.307	373.393	-10.30826	389.995	-6.32034
	100th %ile (ms)	429.387	378.966	-11.74255	396.56	-7.64508

Run with 4 search clients

Operation	Metrics	CS Enabled with lucene default (15 slices)	CS Enabled with slice 2	Slice 2 vs lucene default (% improvement)	CS Enabled with slice 4	Slice 4 vs lucene default (% improvement)
	CPU	99%	96%		98%
	Slice Count	15	2		4
range
	Mean Throughput	16.7	17.19	2.93413	17.01	1.85629
	Max Throughput	17.04	17.57	3.11033	17.5	2.69953
	50th %ile	225.61	216.77	-3.91827	217.25	-3.70551
	90th %ile	248.45	238.6	-3.96458	257.618	3.69008
	99th %ile	276.59	275.491	-0.39734	303.279	9.6493
	100th %ile	282.84	294.94	4.27804	339.97	20.1987

distance_amount_agg
	Mean Throughput	0.29	0.36	24.13793	0.34	17.24138
	Max Throughput	0.29	0.36	24.13793	0.34	17.24138
	50th %ile	13585.2	11138.5	-18.01004	11837	-12.86842
	90th %ile	13674	11261.9	-17.64005	12979.7	-5.07752
	99th %ile	13774.3	11322	-17.80345	13911.3	0.99461
	100th %ile	14014.8	11419.5	-18.51828	14112.9	0.69997

autohisto_agg
	Mean Throughput	7.11	7.88	10.82982	8.29	16.59634
	Max Throughput	7.22	8	10.80332	8.4	16.34349
	50th %ile	542.34	489.69	-9.70793	461.496	-14.90652
	90th %ile	564.62	544.13	-3.62899	529.06	-6.29804
	99th %ile	594.76	574.69	-3.37447	574.31	-3.43836
	100th %ile	612.36	611.02	-0.21883	594.22	-2.96231

date_histogram_agg
	Mean Throughput	7.04	7.81	10.9375	8.06	14.48864
	Max Throughput	7.09	7.85	10.71932	8.11	14.38646
	50th %ile	557.81	503.083	-9.81105	487.88	-12.53653
	90th %ile	582.178	559.624	-3.87407	548.33	-5.81403
	99th %ile	620.33	581.851	-6.20299	610.53	-1.5798
	100th %ile	633.26	633.454	0.03064	627.16	-0.96327

Next Steps:

I only see FeatureFlag based mechanism currently to get the cluster static settings (please ignore the name for now) value using a static method. In the PoC I have created a similar class for this new search setting, was wondering if there is any better way to achieve this. Later the plan is to convert same setting to a dynamic one once lucene changes are available.

sohami · 2023-07-20T16:57:20Z

@reta @andrross would like to get your feedback on this

sohami added enhancement Enhancement or improvement to existing feature or request untriaged labels May 2, 2023

kotwanikunal mentioned this issue May 11, 2023

Add filecache support in clear indices cache API #7498

Merged

6 tasks

dbwiddis added Indexing & Search Search Search query, autocomplete ...etc and removed untriaged Indexing & Search labels May 12, 2023

macohen removed the Search Search query, autocomplete ...etc label May 15, 2023

anasalkouz added the distributed framework label May 16, 2023

anasalkouz assigned sohami May 18, 2023

jed326 mentioned this issue May 24, 2023

Concurrent Search Tasks Response Updates #7673

Merged

6 tasks

sohami closed this as completed Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Concurrent Segment Search] Mechanism to have custom slice calculation for each shard #7358

[Concurrent Segment Search] Mechanism to have custom slice calculation for each shard #7358

sohami commented May 2, 2023

sohami commented Jun 7, 2023 •

edited

Loading

sohami commented Jul 20, 2023 •

edited

Loading

sohami commented Jul 20, 2023

[Concurrent Segment Search] Mechanism to have custom slice calculation for each shard #7358

[Concurrent Segment Search] Mechanism to have custom slice calculation for each shard #7358

Comments

sohami commented May 2, 2023

sohami commented Jun 7, 2023 • edited Loading

sohami commented Jul 20, 2023 • edited Loading

Background:

Problem Statement:

Proposal:

Caveats:

Setup:

Run with 2 search clients

Run with 4 search clients

Next Steps:

sohami commented Jul 20, 2023

sohami commented Jun 7, 2023 •

edited

Loading

sohami commented Jul 20, 2023 •

edited

Loading