Skip to content

Conversation

@atris
Copy link
Contributor

@atris atris commented Aug 28, 2025

The cancellation tests could deadlock when threads are delayed by OS
scheduling. If cancellation triggers before all threads start, late
threads may hit a code path where batchReduceSize causes the latch
callback to be deferred to a MergeTask. Under certain timing conditions,
these callbacks never execute, causing latch.await() to hang indefinitely.

Ensure latch.countDown() is always called by wrapping consumeResult in
try-catch. This guarantees test completion regardless of cancellation
timing or exceptions.

Fixes #19094

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

  The cancellation tests could deadlock when threads are delayed by OS
  scheduling. If cancellation triggers before all threads start, late
  threads may hit a code path where batchReduceSize causes the latch
  callback to be deferred to a MergeTask. Under certain timing conditions,
  these callbacks never execute, causing latch.await() to hang indefinitely.

  Ensure latch.countDown() is always called by wrapping consumeResult in
  try-catch. This guarantees test completion regardless of cancellation
  timing or exceptions.

  Fixes opensearch-project#19094

Signed-off-by: Atri Sharma <atri.jiit@gmail.com>
@atris atris requested a review from a team as a code owner August 28, 2025 16:00
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Aug 28, 2025
@github-actions
Copy link
Contributor

✅ Gradle check result for ef3dcac: SUCCESS

@codecov
Copy link

codecov bot commented Aug 28, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.96%. Comparing base (f5d41fb) to head (ef3dcac).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19171      +/-   ##
============================================
+ Coverage     72.95%   72.96%   +0.01%     
+ Complexity    69701    69695       -6     
============================================
  Files          5655     5655              
  Lines        319867   319873       +6     
  Branches      46337    46338       +1     
============================================
+ Hits         233364   233410      +46     
+ Misses        67584    67552      -32     
+ Partials      18919    18911       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross andrross merged commit 074b9d3 into opensearch-project:main Aug 28, 2025
35 of 36 checks passed
@kaushalmahi12
Copy link
Contributor

Thanks! @atris for addressing this.

atris added a commit to atris/OpenSearch that referenced this pull request Aug 28, 2025
…arch-project#19171)

The cancellation tests could deadlock when threads are delayed by OS
scheduling. If cancellation triggers before all threads start, late
threads may hit a code path where batchReduceSize causes the latch
callback to be deferred to a MergeTask. Under certain timing conditions,
these callbacks never execute, causing latch.await() to hang indefinitely.

Ensure latch.countDown() is always called by wrapping consumeResult in
try-catch. This guarantees test completion regardless of cancellation
timing or exceptions.

Signed-off-by: Atri Sharma <atri.jiit@gmail.com>
pranikum pushed a commit to pranikum/OpenSearch that referenced this pull request Sep 4, 2025
…arch-project#19171)

The cancellation tests could deadlock when threads are delayed by OS
scheduling. If cancellation triggers before all threads start, late
threads may hit a code path where batchReduceSize causes the latch
callback to be deferred to a MergeTask. Under certain timing conditions,
these callbacks never execute, causing latch.await() to hang indefinitely.

Ensure latch.countDown() is always called by wrapping consumeResult in
try-catch. This guarantees test completion regardless of cancellation
timing or exceptions.

Signed-off-by: Atri Sharma <atri.jiit@gmail.com>
kh3ra pushed a commit to kh3ra/OpenSearch that referenced this pull request Sep 5, 2025
…arch-project#19171)

The cancellation tests could deadlock when threads are delayed by OS
scheduling. If cancellation triggers before all threads start, late
threads may hit a code path where batchReduceSize causes the latch
callback to be deferred to a MergeTask. Under certain timing conditions,
these callbacks never execute, causing latch.await() to hang indefinitely.

Ensure latch.countDown() is always called by wrapping consumeResult in
try-catch. This guarantees test completion regardless of cancellation
timing or exceptions.

Signed-off-by: Atri Sharma <atri.jiit@gmail.com>
jainankitk pushed a commit to jainankitk/OpenSearch that referenced this pull request Sep 22, 2025
…arch-project#19171)

The cancellation tests could deadlock when threads are delayed by OS
scheduling. If cancellation triggers before all threads start, late
threads may hit a code path where batchReduceSize causes the latch
callback to be deferred to a MergeTask. Under certain timing conditions,
these callbacks never execute, causing latch.await() to hang indefinitely.

Ensure latch.countDown() is always called by wrapping consumeResult in
try-catch. This guarantees test completion regardless of cancellation
timing or exceptions.

Signed-off-by: Atri Sharma <atri.jiit@gmail.com>
jainankitk pushed a commit to jainankitk/OpenSearch that referenced this pull request Sep 22, 2025
…arch-project#19171)

The cancellation tests could deadlock when threads are delayed by OS
scheduling. If cancellation triggers before all threads start, late
threads may hit a code path where batchReduceSize causes the latch
callback to be deferred to a MergeTask. Under certain timing conditions,
these callbacks never execute, causing latch.await() to hang indefinitely.

Ensure latch.countDown() is always called by wrapping consumeResult in
try-catch. This guarantees test completion regardless of cancellation
timing or exceptions.

Signed-off-by: Atri Sharma <atri.jiit@gmail.com>
Signed-off-by: Ankit Jain <jainankitk@apache.org>
jainankitk pushed a commit to jainankitk/OpenSearch that referenced this pull request Sep 22, 2025
…arch-project#19171)

The cancellation tests could deadlock when threads are delayed by OS
scheduling. If cancellation triggers before all threads start, late
threads may hit a code path where batchReduceSize causes the latch
callback to be deferred to a MergeTask. Under certain timing conditions,
these callbacks never execute, causing latch.await() to hang indefinitely.

Ensure latch.countDown() is always called by wrapping consumeResult in
try-catch. This guarantees test completion regardless of cancellation
timing or exceptions.

Signed-off-by: Atri Sharma <atri.jiit@gmail.com>
Signed-off-by: Ankit Jain <jainankitk@apache.org>
asimmahmood1 pushed a commit to jainankitk/OpenSearch that referenced this pull request Sep 23, 2025
…arch-project#19171)

The cancellation tests could deadlock when threads are delayed by OS
scheduling. If cancellation triggers before all threads start, late
threads may hit a code path where batchReduceSize causes the latch
callback to be deferred to a MergeTask. Under certain timing conditions,
these callbacks never execute, causing latch.await() to hang indefinitely.

Ensure latch.countDown() is always called by wrapping consumeResult in
try-catch. This guarantees test completion regardless of cancellation
timing or exceptions.

Signed-off-by: Atri Sharma <atri.jiit@gmail.com>
vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Sep 26, 2025
…arch-project#19171)

The cancellation tests could deadlock when threads are delayed by OS
scheduling. If cancellation triggers before all threads start, late
threads may hit a code path where batchReduceSize causes the latch
callback to be deferred to a MergeTask. Under certain timing conditions,
these callbacks never execute, causing latch.await() to hang indefinitely.

Ensure latch.countDown() is always called by wrapping consumeResult in
try-catch. This guarantees test completion regardless of cancellation
timing or exceptions.

Signed-off-by: Atri Sharma <atri.jiit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for SearchPhaseControllerTests

3 participants