Skip to content

Conversation

@Swiddis
Copy link
Collaborator

@Swiddis Swiddis commented Feb 3, 2026

Description

TPCH is flaky again in CI (#5103 / #4261), but I haven't seen it failing in main for a while and I see we have the workaround that waits for docs to be reported.

I think the situation is that the CI server is running a multi-node integ test, while our ITs normally are single-node. For multiple nodes, it may be the case that one node reports documents while the other doesn't. On that hypothesis, this change forces the tests to wait for all shards to accept new documents.

I tried just wait_for_active_shards=all but this causes integ tests to hang on single-node setups because there's nowhere to put the replica. So in addition I added a helper that sets the replica count to 0 on integ test indices, which should fix both the waiting issue and the above routing issue.

Related Issues

Resolves #5103
Resolves #4261 (I hope)

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Summary by CodeRabbit

  • Tests
    • Updated test index creation to set replicas to zero
    • Modified bulk data insertion refresh behavior for improved synchronization
    • Streamlined test setup by removing unnecessary wait conditions

Walkthrough

Test utilities now always send an index JSON body with settings.index.number_of_replicas = 0 when creating indices, bulk-loads use refresh=wait_for&wait_for_active_shards=all, and an integration test wait-for-data loop after index creation was removed to return immediately after loading existing indices.

Changes

Cohort / File(s) Summary
Test utilities: index creation & bulk load
integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java
Always build/send an index JSON body when creating an index; added private helper to set settings.index.number_of_replicas = 0. Changed bulk insert URL from ?refresh=true to ?refresh=wait_for&wait_for_active_shards=all.
Integration test flow: remove post-create wait
integ-test/src/test/java/org/opensearch/sql/legacy/SQLIntegTestCase.java
Removed the post-index-creation polling loop that waited for documents (and its InterruptedException handling); the function now returns immediately after loading the index if it exists.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Another TPCH fix: Wait for all shards' accurately reflects the main change: modifying shard synchronization behavior in integration tests to address TPCH flakiness.
Description check ✅ Passed The description clearly explains the problem (TPCH flakiness in CI), the hypothesis (multi-node vs single-node environments), and the implemented solutions (waiting for all shards and setting replicas to 0).
Linked Issues check ✅ Passed The changes directly address the root causes identified in #5103 and #4261: waiting for all shards to prevent timing issues in multi-node setups and setting replicas to 0 to avoid hangs on single-node test environments.
Out of Scope Changes check ✅ Passed All changes are focused on fixing TPCH integration test reliability: modifying bulk insert refresh behavior, adding replica count helper, and removing unnecessary wait-for-docs loop, all directly scoped to the linked issues.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

RyanL1997
RyanL1997 previously approved these changes Feb 3, 2026
@Swiddis Swiddis enabled auto-merge (squash) February 3, 2026 18:17
@Swiddis
Copy link
Collaborator Author

Swiddis commented Feb 3, 2026

nvm, this hangs because it waits forever for replicas on a single node. But I can work around that...

Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Integration Test Failed for sql-3.5.0 [BUG] CalcitePPLTcphIT is flaky

2 participants