Enabling Intra Segment Concurrent Search for only queries without aggregations #18879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

expani wants to merge 7 commits into opensearch-project:main from expani:intraseg_phase1

Contributor

expani commented Jul 31, 2025 •

edited

Loading

Description

Enabling Intra Segment Concurrent Search for queries without aggregations.

Introduced Cluster and Index level settings similar to Concurrent Segment Search.

Additionally, there is an extra cluster/index setting named partition_size which will control how large segments should be sliced/partitioned.

Related Issues

Resolves #18851
Partially resolves #18849 as we are not introducing Auto Mode and an execution hint with this PR.

Another limitation with the current PR is Intra Segment Search will only work if track_total_hits is disabled. Will follow up in a separate PR. It needs handling of TotalHitCountCollector similar to how Lucene did Wrote more about this in #18854

github-actions bot added discuss enhancement Search labels

Contributor

github-actions bot commented Jul 31, 2025

❌ Gradle check result for 76f81c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Contributor

github-actions bot commented Jul 31, 2025

❌ Gradle check result for 5721e4f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

expani marked this pull request as ready for review

July 31, 2025 20:53

expani requested review from a team, Bukhtawar, CEHENKLE, Rishikesh1159, VachaShah, anasalkouz, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners

July 31, 2025 20:53

Contributor

github-actions bot commented Jul 31, 2025

❌ Gradle check result for 2bb652b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

expani added 2 commits

July 31, 2025 15:06


          Initial working commit for enabling Intra Segment Search without UTs/ITs

e31caf2

Signed-off-by: expani <anijainc@amazon.com>


          Disabling intra segment concurrent search if track total hits is enabled

b4c7602

Signed-off-by: expani <anijainc@amazon.com>

expani added 2 commits

July 31, 2025 15:06


          Spotless applied

c045538

Signed-off-by: expani <anijainc@amazon.com>


          Added more tests and refactored slicing coder

8d9f29f

Signed-off-by: expani <anijainc@amazon.com>

expani force-pushed the intraseg_phase1 branch from 2bb652b to 8d9f29f Compare

July 31, 2025 22:09


          Conflicts and spotless

1f3de07

Signed-off-by: expani <anijainc@amazon.com>

Contributor

github-actions bot commented Jul 31, 2025

❌ Gradle check result for 1f3de07: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

This was referenced Aug 1, 2025

[AUTOCUT] Gradle Check Flaky Test Report for SmokeTestMultiNodeClientYamlTestSuiteIT #14408

Open

[AUTOCUT] Gradle Check Flaky Test Report for TransferManagerBlobContainerReaderTests #18872

Open

[AUTOCUT] Gradle Check Flaky Test Report for TransferManagerRemoteDirectoryReaderTests #16676

Open

[AUTOCUT] Gradle Check Flaky Test Report for ClientYamlTestSuiteIT #14319

Open


          Fixed more tests

6115def

Signed-off-by: expani <anijainc@amazon.com>

Contributor

github-actions bot commented Aug 1, 2025

❌ Gradle check result for 6115def: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?


          Made method non-abstract to make backward compatibility happy

39dd2ee

Signed-off-by: expani <anijainc@amazon.com>

Contributor

github-actions bot commented Aug 1, 2025

❌ Gradle check result for 39dd2ee: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

sandeshkr419 reviewed

View reviewed changes

server/src/main/java/org/opensearch/search/DefaultSearchContext.java

    
                          && Boolean.TRUE.equals(requestShouldUseConcurrentIntraSegmentSearch.get());

                  }

                  public void evaluateRequestShouldUseIntraSegmentConcurrentSearch() {

Member

sandeshkr419 Aug 1, 2025 •

edited

Loading

Should this method be private?

Also, can we refactor it better to avoid multiple if-else:

private void evaluateRequestShouldUseIntraSegmentConcurrentSearch() {
    // Check for any condition that forces concurrency to be disabled.
    final boolean isUnsupported = (sort != null && sort.isSortOnTimeSeriesField())
                               || (aggregations() != null)
                               || (terminateAfter != DEFAULT_TERMINATE_AFTER)
                               || (trackTotalHitsUpTo != TRACK_TOTAL_HITS_DISABLED); // TODO: Need to handle TotalHitCountCollectorManager

    // Check for explicitly enables concurrency.
    final boolean isEnabledByMode = concurrentIntraSegmentSearchMode.equals(CONCURRENT_SEGMENT_SEARCH_MODE_ALL); // TODO: Handle auto mode here

    requestShouldUseConcurrentIntraSegmentSearch.set(isEnabledByMode && !isUnsupported);
}

Contributor

atris commented Aug 1, 2025

Few comments:

partition_size setting is confusing. Why not derive optimal partitions from segment size/thread pool capacity? Fixed partition sizes don't adapt to varying segment characteristics.

Disabling for track_total_hits=true makes sense initially, but this severely limits adoption. Most production queries need hit counts. It would be good to have a concrete plan for TotalHitCountCollector support.

Consider unified namespace: search.concurrent.segment.* and search.concurrent.intra_segment.*

partition_size as integer is problematic. Segments vary 1KB to 10GB. Consider one of the following:

Adaptive partitioning based on segment size
partition_size as percentage/ratio
Separate settings for different segment size buckets

Conditional logic in QueryPhase is getting complex. Extract strategy pattern:

interface SearchExecutionStrategy {
    boolean canExecute(SearchContext context);
    void execute(SearchContext context) throws IOException;
}

Thread safety is unclear for intra-segment collectors. Please document collector state isolation guarantees. Are FieldDoc comparisons thread-safe across partition boundaries?

Benchmarks providing the following data would be good:

Performance vs thread count (1,2,4,8,16 threads)
Small vs large segments (< 100MB vs > 1GB)
Query complexity impact (term vs boolean vs range)
Memory overhead per partition

How are results combined from N partitions? If linear merge, this negates benefits for large result sets. We should do heap-based merge for top-K queries.

What happens when partition_size > segment doc count? We should gracefully degrade to single-threaded execution, not fail or create empty partitions.

Sorting exclusion logic is incomplete. Only checks time series fields, but concurrent search breaks with any custom scoring/sorting that depends on document order across segment boundaries.

terminateAfter interaction seems to be problematic. If terminateAfter=100 and partition_size=10, each partition terminates at 10 docs, returning 100 total instead of intended 100 limit. We should have global termination coordination.

atris requested changes

View reviewed changes

Contributor

atris left a comment

Please refer to: #18879 (comment)

opensearch-ci-bot mentioned this pull request

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreKafkaIT #17693

Open

Contributor Author

expani commented Aug 4, 2025

partition_size setting is confusing. Why not derive optimal partitions from segment size/thread pool capacity? Fixed partition sizes don't adapt to varying segment characteristics.

It is indicative of the smallest segment partition size below which splitting it would not be performant. Consider, a segment with 10 docs, we wouldn't want to split it to 3 partitions with [ 4, 3, 3 ] docs. Currently, the default of 10_000 means any segment with just 9999 docs or lower won't be split any further.

Disabling for track_total_hits=true makes sense initially, but this severely limits adoption. Most production queries need hit counts. It would be good to have a concrete plan for TotalHitCountCollector support.

See #18854 for the same.

Conditional logic in QueryPhase is getting complex. Extract strategy pattern:

This is unclear. There are no changes in QueryPhase

What happens when partition_size > segment doc count? We should gracefully degrade to single-threaded execution, not fail or create empty partitions.

We would never create empty partitions. It is handled in the code.

Sorting exclusion logic is incomplete. Only checks time series fields, but concurrent search breaks with any custom scoring/sorting that depends on document order across segment boundaries.

terminateAfter interaction seems to be problematic. If terminateAfter=100 and partition_size=10, each partition terminates at 10 docs, returning 100 total instead of intended 100 limit. We should have global termination coordination.

These are existing limitations of Concurrent Segment Search which can be handled separately instead of tying it up with this PR.

Thread safety is unclear for intra-segment collectors. Please document collector state isolation guarantees. Are FieldDoc comparisons thread-safe across partition boundaries?

Every segment's partition will have it's own collector. Any collector sharing state is disabled like TotalHitCount and others have been kept as ConcurrentSegmentSearch. Let me know if you had any specific one in mind.

partition_size as integer is problematic. Segments vary 1KB to 10GB. Consider one of the following:
Adaptive partitioning based on segment size
partition_size as percentage/ratio
Separate settings for different segment size buckets

Partition size is to ensure we don't split a segment below a certain threshold. It doesn't indicate that every segment is split at that specific size.

Currently, we use LPT ( Longest Processing Time ) algorithm with Concurrent Segment Search which would benefit with small segment sizes for achieving equal distribution. Consider, a segment with 1_000_000 docs and a partition size of 1000 and 3 slices, after distribution it would end up as [340_000, 333_000, 333_000 ]

Contributor

atris commented Aug 5, 2025

Thread safety is unclear for intra-segment collectors. Please document collector state isolation guarantees. Are FieldDoc comparisons thread-safe across partition boundaries?

Every segment's partition will have it's own collector. Any collector sharing state is disabled like TotalHitCount and others have been kept as ConcurrentSegmentSearch. Let me know if you had any specific one in mind.

The design seems sound - had these followups:

Result merging: Are partition results merged using standard Lucene TopDocs.merge() after all partitions complete? This is thread-safe.
Collector validation: Where in the code do you prevent stateful collectors? I see aggregations are disabled in evaluateRequestShouldUseIntraSegmentConcurrentSearch() but is there broader validation?
Custom collectors: How do you handle user-provided collectors that might have shared state?

sandeshkr419 mentioned this pull request

[Composite Terms Aggregation] Optimize by removing unnecessary object allocations #18531

Merged

3 tasks

Contributor

jainankitk commented Aug 5, 2025

partition_size setting is confusing. Why not derive optimal partitions from segment size/thread pool capacity? Fixed partition sizes don't adapt to varying segment characteristics.

I also prefer to derive the partition size as a function of total # of docs / # of slices. There is possibility of creating very small partition for some of the segments in this case, that we can better manage using some percentage based threshold setting. But, I do want to ensure that the number of unique segments being partitioned is minimal while achieving best effort equal distribution of documents across different slices

Contributor Author

expani commented Aug 5, 2025 •

edited

Loading

I also prefer to derive the partition size as a function of total # of docs / # of slices. There is possibility of creating very small partition for some of the segments in this case, that we can better manage using some percentage based threshold setting.

@jainankitk The primary question we need to answer is what's the smallest size that a partition of a segment can have. It would only make sense to partition a segment if the Time(Create Collector, BulkScorer, DISI) << Time(Score/Match the Documents) which is very dependent on the complexity of query. With aggregations, this adds another layer of complexity.

Percentage based can still allow very small partition sizes if a shard only has small segments. So, we need to get a sensible default from our existing known workloads like Big5, Http Logs; etc and still give control to the End Users via Cluster setting, index setting and execution hint.

But, I do want to ensure that the number of unique segments being partitioned is minimal while achieving best effort equal distribution of documents across different slices

The primary goal is to reduce the overall query execution time and make it better than Concurrent Segment Search.
So, we need to achieve a better distribution. Since, we are currently using LPT ( Longest Processing Time ) algorithm with #18451 it performs better with smaller partition sizes.

I am planning to ditch this PR in favor of a different one where I will expand more on these lines and give some psuedo code for the implementation before diving into code. Will post in #18851

Contributor Author

expani commented Aug 5, 2025

Result merging: Are partition results merged using standard Lucene TopDocs.merge() after all partitions complete? This is thread-safe.

One collector is created per Slice and the thread in search thread pool will wait for all results from all slices to complete from the index_searcher thread pool here

Collector validation: Where in the code do you prevent stateful collectors? I see aggregations are disabled in evaluateRequestShouldUseIntraSegmentConcurrentSearch() but is there broader validation?

Aggregations are disabled and TotalHitCount is disabled which are the stateful collectors that I am aware of at the moment. If you feel there is some missing case, do let me know.

Custom collectors: How do you handle user-provided collectors that might have shared state?

That's a good point, you mean via plugins extending SearchPlugin ?

Will add the same in #18851

atris mentioned this pull request

[Intra-SegmentConcurrentSearch] Slicing mechanism #18851

Open

jainankitk requested changes

View reviewed changes

Contributor

jainankitk left a comment

I am planning to ditch this PR in favor of a different one where I will expand more on these lines and give some psuedo code for the implementation before diving into code. Will post in #18851

Since we don't want to continue this PR in favor of slightly different approach, can we move this PR in a draft state?

expani closed this

expani mentioned this pull request

[Intra Segment] Measure impact of segment partitioning on OSB workloads #19694

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

sandeshkr419 sandeshkr419 left review comments

atris atris requested changes

jainankitk jainankitk requested changes

anasalkouz Awaiting requested review from anasalkouz anasalkouz is a code owner

andrross Awaiting requested review from andrross andrross is a code owner

ashking94 Awaiting requested review from ashking94 ashking94 is a code owner

Bukhtawar Awaiting requested review from Bukhtawar Bukhtawar is a code owner

CEHENKLE Awaiting requested review from CEHENKLE CEHENKLE is a code owner

cwperks Awaiting requested review from cwperks cwperks is a code owner

dbwiddis Awaiting requested review from dbwiddis dbwiddis is a code owner

gbbafna Awaiting requested review from gbbafna gbbafna is a code owner

jed326 Awaiting requested review from jed326 jed326 is a code owner

kotwanikunal Awaiting requested review from kotwanikunal kotwanikunal is a code owner

mch2 Awaiting requested review from mch2 mch2 is a code owner

msfroh Awaiting requested review from msfroh msfroh is a code owner

owaiskazi19 Awaiting requested review from owaiskazi19 owaiskazi19 is a code owner

reta Awaiting requested review from reta reta is a code owner

Rishikesh1159 Awaiting requested review from Rishikesh1159 Rishikesh1159 is a code owner

sachinpkale Awaiting requested review from sachinpkale sachinpkale is a code owner

saratvemulapalli Awaiting requested review from saratvemulapalli saratvemulapalli is a code owner

shwetathareja Awaiting requested review from shwetathareja shwetathareja is a code owner

sohami Awaiting requested review from sohami sohami is a code owner

VachaShah Awaiting requested review from VachaShah

Labels

discuss enhancement Search