Implement one-batch lookahead for index enumerators #4345

Swiddis · 2025-09-22T05:38:19Z

Description

In local benchmarking of merge operations, I saw we were spending a lot of time waiting for synchronous fetching of batches across both indices.

Because of the PIT-based design, we can't parallelize page fetches directly, but one low-hanging fruit here is to start fetching the next batch as soon as we get the current one, so by the time we start the next batch it'll already be halfway ready. This cuts enumerated merge times by ~40%.

To implement this safely, this PR needs to do a few things:

Register a new thread pool that has authentication context (we can't run background threads if we don't do this)
- See SQLPlugin.java changes. I also fixed our thread configuration settings.
- We need a new pool as we'll hang the worker pool if there's only one thread.
Safely handle whether we have a NodeClient or not within the Calcite enumeration inner loop
- This was the interface change in OpenSearchClient.java, I did several plumbing changes around that update.
Actually implement the background scanner, with a fallback to synchronous scanning if we're missing node context. BackgroundSearchScanner.java

Some alternatives for the long-term:

Implement range-based/adaptive parallel fetching
Skip paginating with the rest client and just go directly through Lucene
Core is working on streaming queries: [RFC] New search streaming API OpenSearch#18725

In draft pending testing.

Related Issues

N/A

Check List

New functionality includes testing.
New functionality has been documented.
New functionality has javadoc added.
New functionality has a user manual doc added.
New PPL command checklist all confirmed.
API changes companion pull request created.
Commits are signed per the DCO using --signoff or -s.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Swiddis · 2025-09-23T01:14:09Z

Security IT failures are confusing me here -- seems like they're all consistently failing but the changed code doesn't show up anywhere in any of the stack traces

…atches-in-enumeration

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

plugin/src/main/java/org/opensearch/sql/plugin/SQLPlugin.java

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Swiddis · 2025-09-25T20:15:23Z

Some additional testing info:

I took 5 million records from the big5 benchmarking dataset and compared the current mainline with this.

First, as sanity, the results are the same for one of the queries requiring a full index enumeration:

source = big5
| eval range_bucket = case(
   `metrics.size` < -10, 'range_1',
   `metrics.size` >= -10 and `metrics.size` < 10, 'range_2',
   `metrics.size` >= 10 and `metrics.size` < 100, 'range_3',
   `metrics.size` >= 100 and `metrics.size` < 1000, 'range_4',
   `metrics.size` >= 1000 and `metrics.size` < 2000, 'range_5',
   `metrics.size` >= 2000, 'range_6')
| stats count() by range_bucket, span(`@timestamp`, 1h) as auto_span
| sort + range_bucket, + auto_span

Current mainline:

fetched rows / total rows = 48/48
+---------+---------------------+--------------+
| count() | auto_span           | range_bucket |
|---------+---------------------+--------------|
| 122464  | 2022-12-31 16:00:00 | range_5      |
| 121585  | 2022-12-31 17:00:00 | range_5      |
| 122052  | 2022-12-31 18:00:00 | range_5      |
| 122220  | 2022-12-31 19:00:00 | range_5      |
| 122163  | 2022-12-31 20:00:00 | range_5      |
| 121840  | 2022-12-31 21:00:00 | range_5      |
| 121606  | 2022-12-31 22:00:00 | range_5      |
| 121889  | 2022-12-31 23:00:00 | range_5      |
| 121088  | 2023-01-01 00:00:00 | range_5      |
| 121943  | 2023-01-01 01:00:00 | range_5      |

After update:

fetched rows / total rows = 48/48
+---------+---------------------+--------------+
| count() | auto_span           | range_bucket |
|---------+---------------------+--------------|
| 122464  | 2022-12-31 16:00:00 | range_5      |
| 121585  | 2022-12-31 17:00:00 | range_5      |
| 122052  | 2022-12-31 18:00:00 | range_5      |
| 122220  | 2022-12-31 19:00:00 | range_5      |
| 122163  | 2022-12-31 20:00:00 | range_5      |
| 121840  | 2022-12-31 21:00:00 | range_5      |
| 121606  | 2022-12-31 22:00:00 | range_5      |
| 121889  | 2022-12-31 23:00:00 | range_5      |
| 121088  | 2023-01-01 00:00:00 | range_5      |
| 121943  | 2023-01-01 01:00:00 | range_5      |

Second, I wanted to benchmark and check for impact. I already tested with joins and it's ~40% faster, but for non-joins we potentially pay overhead for nothing.

For the slowest big5 queries (BG fetches on the left, sync fetches on the right), we see slight perf gains:

For the fastest ones, the performance is approximately the same (some minor latency and throughput diffs but I'm not confident that this isn't just random variation):

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

vamsimanohar · 2025-09-26T00:14:44Z

plugin/src/main/java/org/opensearch/sql/plugin/SQLPlugin.java

            settings,
-            AsyncRestExecutor.SQL_WORKER_THREAD_POOL_NAME,
+            SQL_WORKER_THREAD_POOL_NAME,
            OpenSearchExecutors.allocatedProcessors(settings),


I am not sure if this is the best number.

Ideally it should match the number of search threads since that's where all the requests go, maybe I can find where that number is stored and do a lookup.

Updated to pull the search thread pool count if available, otherwise fallback to node processors. This is what it looks like if you limit the search thread pool under heavy load:

Intuitively this seems like a pretty informative view of what state the cluster's in regarding SQL queries.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/OpenSearchIndex.java

vamsimanohar · 2025-09-26T01:11:58Z

...search/src/main/java/org/opensearch/sql/opensearch/storage/scan/BackgroundSearchScanner.java

+  public void startScanning(OpenSearchRequest request) {
+    if (isAsync()) {
+      nextBatchFuture =
+          CompletableFuture.supplyAsync(() -> client.search(request), backgroundExecutor);


I am surprised this is working without copying the thread context like this: https://github.com/opensearch-project/sql/blob/main/opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchQueryManager.java#L39

What am I missing? Why do we need context copy in other file. Can you print the thread context in this task and see if it has user credentials in case of FGAC?

PPLPermissionsIT can we add a join test here.

This isn't just used for joins, every query goes through this interface. I realized while benchmarking that there were no queries that weren't hitting the pool. That the security ITs pass means either this works or we don't have security ITs.

I believe it works because we supply the cluster settings during the construction of the executor, so it's built-in to the thread context (as opposed to starting a fresh thread with no executor)

plugin/src/main/java/org/opensearch/sql/plugin/SQLPlugin.java

vamsimanohar · 2025-09-26T01:19:48Z

Added few comments. Good one 👍 .

…atches-in-enumeration

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchQueryManager.java

penghuo · 2025-09-30T22:42:31Z

...arch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java

-    }
+    this.client = client;
+    this.bgScanner = new BackgroundSearchScanner(client);
+    this.bgScanner.startScanning(request);


This constructor only been called once per query? I found comments, Could u double confirm?

/** * This Enumerator may be iterated for multiple times, so we need to create opensearch request for * each time to avoid reusing source builder. That's because the source builder has stats like PIT * or SearchAfter recorded during previous search. */ @Override public Enumerable<@Nullable Object> scan() { return new AbstractEnumerable<>() { @Override public Enumerator<Object> enumerator() { OpenSearchRequestBuilder requestBuilder = getOrCreateRequestBuilder(); return new OpenSearchIndexEnumerator( osIndex.getClient(), getFieldPath(), requestBuilder.getMaxResponseSize(), requestBuilder.getMaxResultWindow(), osIndex.buildRequest(requestBuilder), osIndex.createOpenSearchResourceMonitor()); } }; }

It's the same as current behavior, right? If you recreate the enumerator with a new client, you erase all of its current state and start a new search. In that snippet it looks like this is deliberately meant to restart the search multiple times

The concerns is, If scan() metheod is been called multiple times in planning stage, it will invoke startScanning multiple times.

Woah, I wouldn't have expected planning to make a call to scan, seems weird... Can try to find a better way to handle that, but scan intuitively to me means "actually start scanning something"

penghuo · 2025-09-30T22:47:15Z

...search/src/main/java/org/opensearch/sql/opensearch/storage/scan/BackgroundSearchScanner.java

+        nextBatchFuture =
+            CompletableFuture.supplyAsync(() -> client.search(request), backgroundExecutor);


what if fixedThreadPool full, should fallback to sync?

We just buffer: #4345 (comment)

If we fallback to sync, it eliminates the utility of being able to directly view/control the active SQL network requests via the BG thread pool

...arch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

…atches-in-enumeration

Swiddis · 2025-10-07T23:42:19Z

Turns out I flipped the benchmark in my head, so this is overall a regression -- going to put back in draft and figure out a better approach

Implement simple pre-fetching for index enumerators

320bf8e

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Swiddis added the enhancement New feature or request label Sep 22, 2025

Swiddis added 2 commits September 24, 2025 17:48

Merge remote-tracking branch 'upstream/main' into feature/pre-fetch-b…

8ea3372

…atches-in-enumeration

Fix the thread pool handling for background execution

08e40e6

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Swiddis commented Sep 24, 2025

View reviewed changes

plugin/src/main/java/org/opensearch/sql/plugin/SQLPlugin.java Show resolved Hide resolved

Swiddis added 2 commits September 24, 2025 19:00

Fix a link

e38b709

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Implement background scanner handling optional missing NodeClients

25b9ce5

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Swiddis marked this pull request as ready for review September 25, 2025 20:15

Swiddis requested review from GumpacG, LantaoJin, MaxKsyunz, YANG-DB, Yury-Fridlyand, acarbonetto, anirudha, dai-chen, derek-ho, forestmvey, joshuali925, kavithacm, mengweieric, noCharger, penghuo, ps48, qianheng-aws, seankao-az, vamsimanohar and ykmr1224 as code owners September 25, 2025 20:15

Swiddis requested review from RyanL1997 and yuancu as code owners September 25, 2025 20:15

Tweak a sentence's wording

a5a0fbf

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

vamsimanohar reviewed Sep 26, 2025

View reviewed changes

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/OpenSearchIndex.java Show resolved Hide resolved

vamsimanohar reviewed Sep 26, 2025

View reviewed changes

plugin/src/main/java/org/opensearch/sql/plugin/SQLPlugin.java Show resolved Hide resolved

Swiddis added 2 commits September 26, 2025 18:30

Merge remote-tracking branch 'upstream/main' into feature/pre-fetch-b…

d3f4c1b

…atches-in-enumeration

Set default background threads to same as search threads

7eb3a17

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Swiddis added the v3.3.0 label Sep 29, 2025

vamsimanohar previously approved these changes Sep 30, 2025

View reviewed changes

penghuo reviewed Sep 30, 2025

View reviewed changes

Revert to sql-worker

c5eb5a0

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Swiddis dismissed vamsimanohar’s stale review via c5eb5a0 September 30, 2025 23:22

Simplify reset

8b68539

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>

Swiddis removed the v3.3.0 label Sep 30, 2025

Merge remote-tracking branch 'upstream/main' into feature/pre-fetch-b…

be20f67

…atches-in-enumeration

Swiddis requested review from penghuo and vamsimanohar October 3, 2025 20:15

Swiddis added the calcite calcite migration releated label Oct 3, 2025

Swiddis marked this pull request as draft October 7, 2025 23:42

		nextBatchFuture =
		CompletableFuture.supplyAsync(() -> client.search(request), backgroundExecutor);

Implement one-batch lookahead for index enumerators #4345

Are you sure you want to change the base?

Implement one-batch lookahead for index enumerators #4345

Uh oh!

Conversation

Swiddis commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

Swiddis commented Sep 23, 2025

Uh oh!

Uh oh!

Swiddis commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vamsimanohar commented Sep 26, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Swiddis Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Swiddis commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Swiddis commented Sep 22, 2025 •

edited

Loading

Swiddis commented Sep 25, 2025 •

edited

Loading

Swiddis Sep 30, 2025 •

edited

Loading