[META] Establish Mechanisms to Resolve Existing Flaky Tests and Prevent Future Flakiness

## Background

Flaky tests in OpenSearch’s core `./gradlew check` task frequently impact contributors and developers, and block pull requests.  At the time this issue was created there are 130 Flaky tests (coming from the [link](https://github.com/opensearch-project/OpenSearch/issues?q=is%3Aissue%20state%3Aopen%20label%3A%3Etest-failure%20label%3Aautocut%20label%3Aflaky-test)).


Today the automation exists to detect and create GitHub issue with a comprehensive failure report — including links to impacted pull requests (contributors), relevant commits, the Gradle check build logs, and the full test report via the [detector](https://build.ci.opensearch.org/job/gradle-check-flaky-test-detector/) which leverage [OpenSearch Metrics data](https://github.com/opensearch-project/opensearch-metrics).

For failed CI links or locate the failing test(s) developers often use the [Gradle Check Metrics Dashboard](https://metrics.opensearch.org/_dashboards/app/dashboards#/view/e5e64d40-ed31-11ee-be99-69d1dbc75083?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-90d,to:now))&_a=(description:'',filters:!(),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!t,title:'OpenSearch%20Gradle%20Check%20Report',viewMode:view)) dashboard.

However, detection is only one part of the solution. **What’s missing is a repeatable process to drive these issues to closure and reduce recurring flakiness over time.** While we have this automation today there is currently no structured mechanism to tackle, resolve, and close these issues. This leads to:

- Long-standing flaky test issues with no resolution.
- Contributors repeatedly impacted by the same failures.
- Contributors today either close and open the PR or push a new commit to retry the Gradle check.
- Maintainers manually retry the failed build.


### Supporting References

Past GitHub Issues Related to this topic:

* https://github.com/opensearch-project/OpenSearch/issues/11217 
* https://github.com/opensearch-project/OpenSearch/issues/13950
* https://github.com/opensearch-project/OpenSearch/issues/14475 
* https://github.com/opensearch-project/OpenSearch/pull/13786
* https://github.com/opensearch-project/OpenSearch/issues/3713
* https://github.com/opensearch-project/OpenSearch/issues/1715  


## Some mechanisms we can incorporate:

### Retry mechanism for failed tests

 - [ ] Currently, retrying specific Gradle tests is broken due to the transition to the Gradle Develocity plugin (coming from https://github.com/opensearch-project/OpenSearch/pull/13942). Although we’ve retained the Gradle Enterprise plugin (now rebranded as Develocity) from the original fork, we don’t use it since we lack the required license. As a result, the retry functionality tied to this plugin is not working as expected. I suspect this may be contributing to the recent instability. I’ve created a PR attempting to address the issue https://github.com/opensearch-project/OpenSearch/pull/17939.


### Reporting enhancement with seed information

 - [ ] With today’s automation that creates a GitHub issue containing a comprehensive failure report, I propose that we also include the `tests.seed` information used during the failure. I've observed that some failures are genuine and reproducible with a specific seed — for example:

             ./gradlew ':server:internalClusterTest' --tests "org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode" -Dtests.seed=8529B1DD622216C1
  
             ./gradlew ':server:internalClusterTest' --tests "org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode" -Dtests.seed=AE568A72925374C5

     While this information is available in Jenkins logs, adding the seed directly to the failure report would make it easier to identify tests with reproducible failures.  

- [ ] Coming from above retry mechanism, I propose we should surface the tests that are passing with retry and fix them proactively. We should not keep retrying the tests to get them green.

### Flaky Test Tracking (Ongoing)

- [ ]  Periodically track ![Flaky Tests](https://img.shields.io/github/issues-search/opensearch-project/OpenSearch?query=is:issue%20state:open%20label:%3Etest-failure%20label:autocut%20label:flaky-test&label=Flaky%20Tests) and keep the count down. We can leverage community meetings https://forum.opensearch.org/tag/proj-health-agenda.

### Tag the test author

We can use the approach discussed here https://github.com/opensearch-project/OpenSearch/issues/17934#issuecomment-2816118376.

Related Child Issue https://github.com/opensearch-project/OpenSearch/issues/18271

- [ ]  This could be ambitious, but may be we can have an automation to tag the flaky test owner? Using the [OpenSearch Gradle Check Metrics Dashboard](https://metrics.opensearch.org/_dashboards/app/dashboards#/view/e5e64d40-ed31-11ee-be99-69d1dbc75083) I was able to see the top 100 failing tests that are flaky since past one year. At least we can have a mechanism to tag the owners of top hitters. 

- [ ]  To begin with we can try with git log cli to find the recent commit author and tag them in a comment on the flaky report issue? There could be other ways to do this but I'm open for thoughts.

### Identify flaky test proactively

- [ ] **Before merging the PR**: Similar to what we have today with benchmark workflow that runs on a request and approval, we can have a similar workflow where even before merging the PR we can target to run the tests `N` times in row with multiple combination to detect the flakiness. This gives us some confidence before we merge the PR (in this process we have to mute/exclude the old flaky tests to exactly determine the new failures). Related Issue https://github.com/opensearch-project/opensearch-build/issues/5481.

- [ ] **After merging the PR (post-action)**: Today we run `gradle check` on a commit after the PR is merged, we should do this more often and periodically to find flaky tests and for an identified commit we can have an automation to create a PR reverting the commit. Similar topic discussed here https://github.com/opensearch-project/opensearch-build/issues/4810.



### Related component

Build

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[META] Establish Mechanisms to Resolve Existing Flaky Tests and Prevent Future Flakiness #17974

Background

Supporting References

Some mechanisms we can incorporate:

Retry mechanism for failed tests

Reporting enhancement with seed information

Flaky Test Tracking (Ongoing)

Tag the test author

Identify flaky test proactively

Related component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[META] Establish Mechanisms to Resolve Existing Flaky Tests and Prevent Future Flakiness #17974

Description

Background

Supporting References

Some mechanisms we can incorporate:

Retry mechanism for failed tests

Reporting enhancement with seed information

Flaky Test Tracking (Ongoing)

Tag the test author

Identify flaky test proactively

Related component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions