Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] Fix random test failures #1715

Closed
30 of 37 tasks
anasalkouz opened this issue Dec 13, 2021 · 24 comments
Closed
30 of 37 tasks

[Meta] Fix random test failures #1715

anasalkouz opened this issue Dec 13, 2021 · 24 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request flaky-test Random test failure that succeeds on second run Meta Meta issue, not directly linked to a PR

Comments

@anasalkouz
Copy link
Member

anasalkouz commented Dec 13, 2021

PRs were blocked by transient gradle check errors multiple times. Provide a plan to stabilize the tests.

@anasalkouz anasalkouz added enhancement Enhancement or improvement to existing feature or request flaky-test Random test failure that succeeds on second run labels Dec 13, 2021
@anasalkouz anasalkouz changed the title Put a Plan Put a plan for the flaky random test failures Dec 13, 2021
@anasalkouz anasalkouz changed the title Put a plan for the flaky random test failures Put a plan for the flakey random test failures Dec 13, 2021
@andrross
Copy link
Member

andrross commented Dec 14, 2021

I did a quick experiment overnight on my dev machine where I ran the internalClusterTest all night in a loop:

for i in $(seq 0 1000) ; do echo "Iteration: $i" && ./gradlew ':server:internalClusterTest' >> test-output.txt 2>&1 ; done

Results:

$ egrep 'BUILD (SUCCESSFUL|FAILED)' test-output.txt | wc -l
152
$ egrep 'BUILD FAILED' test-output.txt | wc -l
3
$ egrep '^REPRODUCE' test-output.txt | less -S | uniq
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate" -Dtests.seed=7B8B067879F3C91F -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en -Dtests.timezone=Brazil/West -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=9F8306D99E2C2EF1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=id -Dtests.timezone=Asia/Aqtau -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureLastSuccessfulSettingsUpdate" -Dtests.seed=6D39D8439C254FF0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-VE -Dtests.timezone=Pacific/Honolulu -Druntime.java=17

All 3 failures were caused by "Suite timeout exceeded (>= 1200000 msec)."

From this I'll make a couple hypotheses:

  1. There is a bug in the logic of ShardIndexingPressureSettingsIT that sometimes causes it to hang and fail with the overall test timeout. See previous issue where this same failure occurred.
  2. While failures we see in the PR workflows that run ./gradlew check often manifest as a failure somewhere in :server:internalClusterTest, they are not the result of buggy logic within the tests themselves, but instead are the result of interference between gradle tasks running concurrently, or some other problem with the CI environment. (I make this claim because the ~2% failure rate observed in my experiment seems much lower than the failure rate we're observing in the PR checks)

I'm going to repeat my experiment but run the full check task instead of just :server:internalClusterTest. If hypothesis 2 is correct then I should see a higher failure rate than 3 out of 152 observed in this first experiment.

Dev environment:

  • OS: Ubuntu 20.04
  • Host type: c6i.8xlarge
  • Branch: main at 309649ce8a

@saratvemulapalli
Copy link
Member

Another flaky test:
Coming from: #1725

* What went wrong:
Execution failed for task ':qa:rolling-upgrade:v1.3.0#oldClusterTest'.
> `node{:qa:rolling-upgrade:v1.3.0-0}` failed to wait for ports files after 120000 MILLISECONDS

@dreamer-89
Copy link
Member

Looking into it.

@dreamer-89
Copy link
Member

dreamer-89 commented Dec 16, 2021

A simple plan to begin with can involve below steps:

  1. Analyze.
    Analyze last X failed Jenkins builds (X=20), identify failed tests and count frequency of failure. This will help in priortizing the right failure.

  2. Reproduce.
    Failures identified above may need more deep dive for root causes; and also the ability to reproduce those failures locally. The expectation from this step is to have dev setup where failures can be replicated. Begin with targeted test (fast); if it does not help, run entire tests suite (slow). Failures may not always happen so need to repeat the tests multiple times as done by @andrross above. Replication may need setup similar to as used in Jenkins (worst case; have Jenkins setup). Add required logs wherever necessary to deep dive into the issue. Replication may discover new bugs/issues in tests, these failures should be properly documented and fixed as well in order to increase the overall tests stability.

  3. Fix. Fixing tests depends on type of failure and can broadlly be classified in below categories. The step may run in sequence after step 2 or in parallel depending upon failure identified in step 1.
    a. True transient failures.
    Failures which are happen randomly and are out of our control. For e.g. nodes connection time out happening due to bad node, networking issue etc. The only fix in this case it to either increase corresponding parameters (timeout) or skip the test until a proper fix is identified.
    b. Setup related.
    There may be class of failures related to mis-configurations (bcwd compatibility tests etc) and easiest one to identify. These tests may need minor configuration changes.
    b. Bug fix.
    The remaining class of failures are corner cases which are more tricky root cause and may need specific area of expertise. Based on area of failure, required engineer needs to be involved to debug the issue further.

@andrross
Copy link
Member

  1. Analyze last X failed Jenkins builds (X=20)

I think it is a good idea to collect this data. It might be a bit hard to separate out the failures that were caused by the change in the PR that triggered the build. Setting up a test machine to run checks continually should be able to get similar data, and will have the benefit of running against a static code base.

  1. Reproduce

We've probably seen enough of these to know they aren't reproducable when re-run in isolation. We have open issues with quite a few errors and none of them can be reproduced even when re-running the individual test many many times. I think running the entire test suite is the way to go, but we probably don't need to worry about the Jenkins stuff and can just trigger the ./gradlew check command directly.

@saratvemulapalli
Copy link
Member

Another one, coming from: #1766

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority" -Dtests.seed=28AD28E1A3FF50C7 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-PH -Dtests.timezone=Etc/GMT+8 -Druntime.java=15

org.opensearch.discovery.StableMasterDisruptionIT > testStaleMasterNotHijackingMajority FAILED
    java.lang.AssertionError: node_t1: [Tuple [v1=node_t2, v2=null]]
        at __randomizedtesting.SeedInfo.seed([28AD28E1A3FF50C7:77AB65EE82248FCB]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.opensearch.discovery.StableMasterDisruptionIT.lambda$testStaleMasterNotHijackingMajority$5(StableMasterDisruptionIT.java:253)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1048)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1021)
        at org.opensearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority(StableMasterDisruptionIT.java:250)

@andrross
Copy link
Member

I ran another experiment over the weekend, the theory being that maybe :qa:mixed-cluster:v1.2.2#mixedClusterTest was interfering with :server:internalClusterTest:

for i in $(seq 0 1000) ; do echo "Iteration: $i" && ./gradlew clean > /dev/null 2>&1 && ./gradlew :server:internalClusterTest :qa:mixed-cluster:v1.2.2#mixedClusterTest >> ../build-failure-tests/test-output-2021-12-17_2.txt 2>&1 ; done

but the results were 7 failures out of 330, which is in line with the ~2% failure rate of the integ tests in isolation. The failures were:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=60436199814D8A58 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sr-CS -Dtests.timezone=Etc/GMT+5 -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=8EC37C710AA42BCE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=no-NO -Dtests.timezone=EET -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.ClusterHealthIT.testHealthOnMasterFailover" -Dtests.seed=B4175006736B7460 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-US -Dtests.timezone=Africa/Casablanca -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites" -Dtests.seed=6AF32DFBEB864CEE -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hant-TW -Dtests.timezone=PRC -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites" -Dtests.seed=D921821394B6DBAA -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-GB -Dtests.timezone=America/Nipigon -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=FA529FAA49915455 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-SY -Dtests.timezone=AET -Druntime.java=17
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting" -Dtests.seed=FC550CFC70BBB318 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hans-CN -Dtests.timezone=America/Knox_IN -Druntime.java=17

There are likely bugs within ClusterHealthIT, ShardIndexingPressureIT, and ShardIndexingPressureSettingsIT that cause rare failures. But it remains a mystery what is causing ./gradlew check to fail at a much higher rate in the CI workflow than in these experiments.

@dblock
Copy link
Member

dblock commented Dec 22, 2021

#1725

I opened #1793 for this one specifically.

@nknize
Copy link
Collaborator

nknize commented Jan 14, 2022

/cc @getsaurabh02

ShardIndexingPressureSettingsIT is a problem child. Can y'all investigate the recurring "Suite timeout exceeded (>= 1200000 msec)." and see if this is either a real issue with the Indexing Pressure implementation or simply a test cluster resourcing issue when run in the context of the entire check suite?

@andrross
Copy link
Member

Suraj @dreamer-89 has been digging into the ShardIndexingPressureSettingsIT failures, tracked in #1843

@nknize
Copy link
Collaborator

nknize commented Jan 14, 2022

Suraj @dreamer-89 has been digging into the ShardIndexingPressureSettingsIT failures, tracked in #1843

👍 Also note open PR #1592

@dblock dblock added the Meta Meta issue, not directly linked to a PR label Jan 14, 2022
@dblock dblock changed the title Put a plan for the flakey random test failures [Meta] Fix random test failures Jan 14, 2022
@dblock
Copy link
Member

dblock commented Jan 14, 2022

I copied some links into the body of this issue... it's quite a list.

@penghuo
Copy link
Contributor

penghuo commented Feb 18, 2022

another one #2176.

@dblock
Copy link
Member

dblock commented Nov 10, 2022

Between gradle check 6786 and 6688 (100 builds) the following tests failed more than once:

org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials}: 12
org.opensearch.test.rest.ClientYamlTestSuiteIT/test {p0=search/30_limits/Regexp length limit}: 6
org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT/test {yaml=search/30_limits/Regexp length limit}: 6
org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests/testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections: 5
org.opensearch.action.support.AutoCreateIndexTests/testParseFailed: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfReplicasIsNonNegative: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfShardsIsNotZero: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfShardsIsNotNegative: 2
org.opensearch.cluster.metadata.IndexMetadataTests/testNumberOfRoutingShards: 2
org.opensearch.cluster.routing.allocation.DiskThresholdSettingsTests/testInvalidHighDiskThreshold: 2
org.opensearch.cluster.allocation.AwarenessAllocationIT/testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness: 2
org.opensearch.common.settings.ScopedSettingsTests/testLoggingUpdates: 2
org.opensearch.cluster.coordination.NoClusterManagerBlockServiceTests/testRejectsInvalidSetting: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=search/320_disallow_queries/Test disallow expensive queries}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=cluster.put_settings/10_basic/Test put and reset persistent settings}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=search.aggregation/240_max_buckets/Max bucket}: 2
org.opensearch.action.support.AutoCreateIndexTests/testParseFailedMissingIndex: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot}: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=cluster.put_settings/10_basic/Test put and reset transient settings}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Add transient remote cluster in sniff mode with invalid proxy settings}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Switch connection mode for configured cluster}: 2
org.opensearch.search.MultiClusterSearchYamlTestSuiteIT/test {yaml=multi_cluster/15_connection_mode_configuration/Add transient remote cluster in proxy mode with invalid sniff settings}: 2
org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT/testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone: 2
org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT/test {p0=scroll/20_keep_alive/Max keep alive}: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client}: 2
org.opensearch.cluster.coordination.ElectionSchedulerFactoryTests/testSettingsValidation: 2
org.opensearch.common.settings.ScopedSettingsTests/testValidate: 2
org.opensearch.repositories.gcs.GoogleCloudStorageBlobStoreRepositoryTests/testChunkSize: 2
org.opensearch.action.admin.cluster.settings.SettingsUpdaterTests/testUpdateOfValidationDependentSettings: 2
org.opensearch.cluster.routing.OperationRoutingTests/testWeightedOperationRoutingWeightUndefinedForOneZone: 2
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client}: 2
org.opensearch.action.admin.cluster.settings.SettingsUpdaterTests/testAllOrNothing: 2
org.opensearch.cluster.metadata.AutoExpandReplicasTests/testInvalidValues: 2

Another ~100 failed once.

@anasalkouz
Copy link
Member Author

I am targeting to close these flakey tests down to zero by Dec 30, 2022. Please if anyone want to help in this effort, feel free to pick one of the flakey test issues in this list

@anasalkouz
Copy link
Member Author

I have added the following 2 issues as a proactive mechanisms to detect flaky test failures and prevent new introduced flaky tests.
#5226
#5227

@Poojita-Raj
Copy link
Contributor

Poojita-Raj commented Nov 15, 2022

Url Status Group Owner Reproducible Note
[BUG] DecommissionControllerTests.testTimesOut f... Closed decommission andrross
[BUG] AwarenessAttributeDecommissionIT.testNodes... assigned decommission pranikum no,100 passing tests on local
[BUG] Failed Integ test testDecommissionStatusUp... assigned decommission pranikum @imRishN opened and merged a fix - #4822 - doesn't resolve issue since it's seen since then
[Meta] Fix random test failures untriaged meta issue
[BUG] testCoordinatingPrimaryThreadedUpdateToSha... pending shardIndexing
[BUG] ShardIndexingPressureIT.testShardIndexingP... pending shardIndexing
[BUG] org.opensearch.action.bulk.BulkIntegration... pending yes, failed 2/100 tests
[BUG] org.opensearch.persistent.PersistentTasksE... pending no, 200 tests passing on local
[BUG] Failures with org.opensearch.smoketest.Smo... pending yes
[BUG] DedicatedClusterSnapshotRestoreIT.testInde... assigned xuezhou yes, failed 3/100 tests Xue wrote original test
[BUG] Deterministic failure of AggregationsTests... pending yes
[BUG] flaky test index/80_geo_point/Single point... Closed MixedClusterClientYamlTestSuiteIT
[BUG] Fix flaky test org.opensearch.index.ShardI... assigned shardIndexing rrpasham yes
[CI] flaky test failure - o.o.indices.stats.Inde... pending yes, failed 3/100 tests off by 1 error
[CI] Test Failure org.opensearch.cluster.allocat... pending @imRishN worked on original PR, had a fix out and merged in (#3646), still seeing failures after that
[BUG] org.opensearch.gateway.QuorumGatewayIT > t... pending no, passing 100 tests
[BUG] org.opensearch.repositories.s3.RepositoryS... untriaged RepositoryS3ClientYamlTestSuiteIT
[BUG] Intermittent test failure - Snapshot and R... untriaged RepositoryS3ClientYamlTestSuiteIT
[BUG] OperationRoutingTests.testWeightedOperatio... pending yes There's one PR out for a fix currently - #4980 - not sure if it resolves issue
[BUG] org.opensearch.search.aggregations.metrics... pending yes There's one PR out for this - #4850
[BUG] Fix flaky org.opensearch.search.PitMultiNo... pending PitMultiNode yes, failed 1/100
[CI] o.o.aliases.IndexAliasesIT.testSameAlias fa... pending AcknowledgedResponse failed no
[CI] o.o.gateway.RecoveryFromGatewayIT.testReuse... untriaged No occurences since April, can be closed out?
[BUG] Fix new flaky test org.opensearch.search.D... pending PitMultiNode yes, failed 1/100 times
[CI] o.o.cluster.remote.test.RemoteClustersIT.te... untriaged No occurences since June, can be closed out?
[TEST] Failures in IndexingMemoryControllerTests... untriaged no Not seen since Jan, can be closed out?
[BUG] org.opensearch.discovery.DiscoveryDisrupti... untriaged Only 1 occurence in May
[BUG] org.opensearch.action.admin.cluster.tasks.... untriaged timeout issue
[BUG] :test:logger-usage:test failure flakey tes... untriaged
[BUG] o.o.search.SearchCancellationIT.testCancel... pending SearchCancellationIT no, passed 100 tests
[BUG] node drop on o.o.cluster.routing.allocatio... pending no, passed 100 tests
[CI] o.o.blocks.SimpleBlocksIT.testAddBlockWhile... pending no, passed 100 tests also documented in issue -#2442
[CI] o.o.versioning.ConcurrentSeqNoVersioningIT.... pending no, passed 100 tests
[CI] flaky test faiure - o.o.indices.recovery.In... pending no, passed 100 tests
[CI] o.o.discovery.SnapshotDisruptionIT.testDisr... pending SnapshotDisruptionIT no, passed 100 times
[BUG] testCancellationDuringQueryPhaseUsingReque... pending SearchCancellationIT no, passed 150 times
[BUG] cluster.routing.PrimaryAllocationIT.testPr... pending no, passed 100 times
[BUG] org.opensearch.search.SearchCancellationIT... pending SearchCancellationIT no, passed 100 times
[BUG] StableMasterDisruptionIT.testStaleMasterNo... pending no, passed 100 times
[CI] flaky test faiure - o.o.upgrades.IndexingIT... untriaged
[BUG] Flaky test failure - v1.2.5#mixedClusterTe... untriaged MixedClusterClientYamlTestSuiteIT
[BUG] Master bootstrap takes time causing interm... pending no, passed 100 tests renamed test
[BUG] ClusterRerouteIT.testDelayWithALargeAmount... untriaged AcknowledgedResponse failed no, passed 100 times
[BUG] Flaky test failure - org.opensearch.blocks... closed Same as #33 -#2472
[BUG] org.opensearch.snapshots.ConcurrentSnapsho... pending no, passed 100 times
[BUG] testRestartIndexCreationAfterFullClusterRe... pending no,passed 100 times
[BUG] org.opensearch.cluster.routing.allocation.... untriaged
[BUG] org.opensearch.discovery.SnapshotDisruptio... untriaged SnapshotDisruptionIT
[CI] Test failure in "org.opensearch.cluster.coo... untriaged
[BUG] Upgrade cli test failure while detecting e... untriaged
[CI] oldClusterTest fails intermittently untriaged
[BUG] Netty Transport test failing with large re... pending No
[BUG] InstallPluginCommandTests.testOfficialPlug... pending No
[BUG] :distribution:packages:rpm:checkExtraction... pending No
[BUG] Transport NIO test intermittently failing ... pending No
[BUG] :rest-api-spec:yamlRestTest org.opensearch... pending No
[BUG] MinimumMasterNodesIT.testThreeNodesNoMaste... pending No Test doesn't exist? renamed to MinimumClusterManagerNodesIT
[BUG] SharedClusterSnapshotRestoreIT.testSnapsho... pending No

@andrross
Copy link
Member

andrross commented Dec 3, 2022

I wrote a script to crawl the Jenkins output for unstable builds: https://gist.github.com/andrross/ee07a8a05beb63f1173bcb98523918b9

Below are the results for the last 1000 builds. There is a long tail of tests with a few failures, but the top 4 failures have issues already (#5219, #4212, #5157, #3603).

41 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} (6561,6561,6561,6577,6587,6591,6591,6598,6645,6709,6711,6711,6717,6750,6751,6766,6778,6778,6779,6779,6779,6782,6879,6879,6880,6880,6952,6953,6953,7074,7074,7074,7080,7082,7082,7177,7200,7201,7224,7277,7310)
23 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections (6585,6681,6962,7046,7090,7095,7149,7149,7149,7158,7188,7206,7206,7253,7253,7253,7274,7274,7274,7327,7463,7483,7492)
22 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections (6607,6616,6628,6700,6700,6720,6759,6759,6762,6828,6887,6971,6971,6975,7027,7112,7115,7168,7168,7202,7315,7315)
17 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness (6562,6601,6627,6717,6741,6908,6921,6925,7036,7047,7112,7149,7422,7447,7495,7517,7555)
11 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testTimeoutWhileThrottling (6556,6593,6594,6594,6598,6599,6601,6602,6602,6602,6742)
9 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue (6790,6828,6965,7220,7256,7315,7361,7447,7543)
8 org.opensearch.cluster.service.MasterServiceTests.classMethod (6894,6894,6894,6894,7074,7074,7177,7177)
8 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client} (6589,6709,6952,6952,6953,6953,7200,7277)
7 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex (6769,7062,7077,7207,7453,7464,7517)
7 org.opensearch.indices.stats.IndexStatsIT.testFilterCacheStats (6585,7154,7183,7255,7292,7300,7551)
4 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone (6599,6602,6731,6771)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing bucket} (6952,6953,7077,7320)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client} (6711,6711,6711,6952)
4 org.opensearch.action.bulk.BulkIntegrationIT.testDeleteIndexWhileIndexing (6624,6635,6723,6979)
4 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all} (7185,7212,7231,7342)
4 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes (6894,6894,7074,7177)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing client} (6591,6591,6952,7201)
4 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testThrottlingForSingleNode (6593,6615,6664,6682)
3 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/teardown} (6766,6953,6956)
3 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Restore a non existing snapshot} (6782,6952,7309)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerInToBeDecommissionedZone (6606,6709,6895)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary (6894,7091,7144)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testInvariantsAndLogsOnDecommissionedNodes (6738,6792,6825)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm (6685,7406)
2 org.opensearch.gateway.QuorumGatewayIT.testQuorumRecovery (6562,7201)
2 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithWriteIndexAndRouting (6723,6979)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexToN (6685,7406)
2 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithGlobalDefaults (6723,6979)
2 org.opensearch.action.bulk.BulkIntegrationIT.testExternallySetAutoGeneratedTimestamp (6723,6979)
2 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing bucket} (6766,7076)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndex (6685,7406)
2 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase (7167,7463)
2 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation (6893,7166)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexFails (6685,7406)
1 org.opensearch.action.admin.indices.create.CreateIndexIT.classMethod (7464)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testDeleteBlobs (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testList (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testMultipleSnapshotAndRollback (6589)
1 org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout (6606)
1 org.opensearch.action.admin.cluster.tasks.PendingTasksBlocksIT.testPendingTasksWithClusterNotRecoveredBlock (6653)
1 org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites (6667)
1 org.opensearch.action.bulk.BulkIntegrationIT.testBulkIndexCreatesMapping (6723)
1 org.opensearch.cluster.decommission.DecommissionControllerTests.testTimesOut (6747)
1 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot} (6758)
1 org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart (6764)
1 org.opensearch.client.PitIT.testDeleteAllAndListAllPits (6781)
1 org.opensearch.client.PitIT.testCreateAndDeletePit (6781)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesGenIncrease (6824)
1 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Get a non existing snapshot} (6953)
1 org.opensearch.client.ReindexIT.testReindexTask (6962)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (6970)
1 org.opensearch.cluster.routing.allocation.decider.ConcurrentRecoveriesAllocationDeciderTests.testClusterConcurrentRecoveries (7022)
1 org.opensearch.search.aggregations.metrics.TDigestPercentilesIT.testMultiValuedFieldWithValueScriptReverse (7208)
1 org.opensearch.cluster.ClusterHealthIT.testHealthOnClusterManagerFailover (7272)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringFetchPhaseUsingRequestParameter (7318)
1 org.opensearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards (7345)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndex (7415)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndexToN (7415)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (7422)
1 org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently (7464)

@dblock
Copy link
Member

dblock commented Dec 6, 2022

@andrross I swear I wrote very similar code to produce #1715 (comment), but where did I put it? :) thank you!

@dblock
Copy link
Member

dblock commented Dec 6, 2022

Found it! https://github.com/dblock/gradle-checks

@Rishikesh1159
Copy link
Member

Rishikesh1159 commented Dec 6, 2022

Thanks @andrross for the script. I ran @andrross script's to get all flaky tests from past 2 months. (From Sep 30 2022 - Dec 5 2022). Here is the List of 104 flaky tests found:

Will crawl builds from 3600 to 7680
------------------
130 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} (3619,3619,3695,3719,3719,3720,3720,3743,3744,3744,3902,3902,4173,4279,4382,4602,4602,4602,4751,4751,4752,4752,4793,4793,4793,4946,4946,4946,5122,5123,5123,5298,5298,5341,5341,5354,5354,5396,5396,5396,5399,5489,5533,5533,5533,5556,5557,5557,5557,5572,5954,5955,5955,6060,6061,6061,6061,6132,6132,6133,6151,6155,6156,6172,6188,6218,6218,6221,6221,6233,6234,6234,6254,6254,6389,6389,6391,6436,6469,6469,6470,6470,6475,6476,6476,6476,6547,6547,6548,6561,6561,6561,6577,6587,6591,6591,6598,6645,6709,6711,6711,6717,6750,6751,6766,6778,6778,6779,6779,6779,6782,6879,6879,6880,6880,6952,6953,6953,7074,7074,7074,7080,7082,7082,7177,7200,7201,7224,7277,7310)
38 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness (3666,3679,4180,4207,4679,4691,4866,4953,5343,5395,5396,5437,5577,5733,5897,5923,6096,6175,6205,6562,6601,6627,6717,6741,6908,6921,6925,7036,7047,7112,7149,7422,7447,7495,7517,7555,7563,7612)
38 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue (3858,3914,3961,4292,4293,4332,4382,4514,4539,4603,4858,4897,5426,5467,5489,5525,5530,5552,5788,5973,6081,6130,6132,6199,6234,6343,6376,6546,6790,6828,6965,7220,7256,7315,7361,7447,7543,7644)
37 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testTimeoutWhileThrottling (6028,6199,6350,6350,6351,6359,6359,6365,6365,6365,6371,6399,6399,6411,6413,6413,6415,6436,6436,6436,6458,6458,6468,6547,6547,6554,6556,6593,6594,6594,6598,6599,6601,6602,6602,6602,6742)
35 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Try to create repository with broken endpoint override and named client} (3619,3719,4382,4638,4792,5122,5294,5354,5395,5531,5556,5878,6060,6128,6133,6151,6152,6152,6156,6156,6218,6254,6390,6436,6436,6475,6548,6589,6709,6952,6952,6953,6953,7200,7277)
29 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimitsAndRejections (6474,6481,6607,6616,6628,6700,6700,6720,6759,6759,6762,6828,6887,6971,6971,6975,7027,7112,7115,7168,7168,7202,7315,7315,7596,7596,7611,7617,7617)
25 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections (6585,6681,6962,7046,7090,7095,7149,7149,7149,7158,7188,7206,7206,7253,7253,7253,7274,7274,7274,7327,7463,7483,7492,7651,7651)
17 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testRestoreSnapshotAllocationDoesNotExceedWatermark (3635,3641,3798,3920,3928,4137,4189,4240,4279,4447,4511,4536,4787,4793,4818,4818,5134)
14 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all} (3658,3695,4453,4599,5142,5347,5740,5858,5894,6183,7185,7212,7231,7342)
14 org.opensearch.indices.stats.IndexStatsIT.testFilterCacheStats (4100,4514,5829,6238,6332,6336,6337,6585,7154,7183,7255,7292,7300,7551)
12 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLast (3964,4234,4268,4272,4446,4826,4879,4891,4975,4975,5114,5121)
12 org.opensearch.cluster.service.MasterServiceTests.classMethod (6894,6894,6894,6894,7074,7074,7177,7177,7634,7634,7634,7634)
9 org.opensearch.action.bulk.BulkIntegrationIT.testDeleteIndexWhileIndexing (3607,3757,3789,3839,4952,6624,6635,6723,6979)
8 org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently (3608,3957,4100,4200,5853,6126,6220,7464)
8 org.opensearch.action.admin.indices.create.CreateIndexIT.classMethod (3608,3957,4100,4200,5853,6126,6220,7464)
8 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing bucket} (4638,5556,6151,6156,6952,6953,7077,7320)
8 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex (6172,6769,7062,7077,7207,7453,7464,7517)
7 org.opensearch.persistent.PersistentTasksExecutorFullRestartIT.testFullClusterRestart (3616,4279,4700,4802,5396,6554,6764)
7 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a repository with a non existing client} (4450,6156,6390,6711,6711,6711,6952)
7 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing client} (5341,5341,6233,6591,6591,6952,7201)
6 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testNRTReplicaPromotedAsPrimary (3700,3852,6371,6894,7091,7144)
6 org.opensearch.client.PitIT.testDeleteAllAndListAllPits (3715,4173,4293,5557,6259,6781)
6 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLastReverse (4271,4329,4533,5011,5114,5114)
6 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testDecommissionStatusUpdatePublishedToAllNodes (5165,5379,5530,5612,5642,5677)
6 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone (6356,6359,6599,6602,6731,6771)
6 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes (6894,6894,7074,7177,7634,7634)
5 org.opensearch.upgrades.RecoveryIT.testRelocationWithConcurrentIndexing (4124,4131,4131,4142,4142)
5 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Register a read-only repository with a non existing bucket} (4450,4792,6389,6766,7076)
5 org.opensearch.clustermanager.ClusterManagerTaskThrottlingIT.testThrottlingForSingleNode (6463,6593,6615,6664,6682)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkIndexCreatesMapping (3607,3789,4952,6723)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Delete a non existing snapshot} (3619,4042,4281,6758)
4 org.opensearch.cluster.decommission.DecommissionControllerTests.testTimesOut (3651,3805,6468,6747)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Get a non existing snapshot} (3695,6390,6476,6953)
4 org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse (3755,4539,5576,6073)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithGlobalDefaults (3789,4952,6723,6979)
4 org.opensearch.action.bulk.BulkIntegrationIT.testExternallySetAutoGeneratedTimestamp (3789,4952,6723,6979)
4 org.opensearch.action.bulk.BulkIntegrationIT.testBulkWithWriteIndexAndRouting (3789,4952,6723,6979)
4 org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites (3932,4946,6391,6667)
4 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingFirstReverse (4279,4294,4420,4714)
4 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation (4320,5358,6893,7166)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/Restore a non existing snapshot} (4751,6782,6952,7309)
4 org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT.test {yaml=repository_s3/20_repository_permanent_credentials/teardown} (5363,6766,6953,6956)
4 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testInvariantsAndLogsOnDecommissionedNodes (5908,6738,6792,6825)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testSegmentReplication_Index_Update_Delete (3739,4867,6401)
3 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaRestarts (4420,4889,6401)
3 org.opensearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards (4894,6393,7345)
3 org.opensearch.index.shard.IndexShardIT.testIndexCanChangeCustomDataPath (4953,4953,4953)
3 org.opensearch.gateway.QuorumGatewayIT.testQuorumRecovery (5165,6562,7201)
3 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndex (6241,6685,7406)
3 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexToN (6241,6685,7406)
3 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerInToBeDecommissionedZone (6606,6709,6895)
2 org.opensearch.http.nio.NioHttpServerTransportTests.testLargeCompressedResponse (3618,7628)
2 org.opensearch.monitor.fs.FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout (3648,6606)
2 org.opensearch.client.BulkProcessorRetryIT.testBulkRejectionLoadWithBackoff (3802,3821)
2 org.opensearch.search.basic.SearchWithRandomIOExceptionsIT.testRandomDirectoryIOExceptions (3814,5399)
2 org.opensearch.search.basic.SearchWithRandomIOExceptionsIT.classMethod (3814,5399)
2 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndex (4178,7415)
2 org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingFirst (4925,4975)
2 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (5302,6970)
2 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testWriteBlobWithRetries (5361,5600)
2 org.opensearch.client.ReindexIT.testReindexTask (6007,6962)
2 org.opensearch.action.admin.cluster.tasks.PendingTasksBlocksIT.testPendingTasksWithClusterNotRecoveredBlock (6170,6653)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testCreateShrinkIndexFails (6685,7406)
2 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm (6685,7406)
2 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase (7167,7463)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadRangeBlobWithRetries (3778)
1 org.opensearch.client.BulkProcessorRetryIT.testBulkRejectionLoadWithoutBackoff (3821)
1 org.opensearch.gateway.RecoveryFromGatewayIT.testReuseInFileBasedPeerRecovery (3837)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testSplitFromOneToN (4178)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=indices.split/30_copy_settings/Copy settings during split index} (4236)
1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=indices.shrink/30_copy_settings/Copy settings during shrink index} (4236)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.classMethod (4420)
1 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testCoordinatingPrimaryThreadedUpdateToShardLimits (4758)
1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringQueryPhase (4926)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesLowerGeneration (5234)
1 org.opensearch.cluster.routing.allocation.RemoteShardsMoveShardsTests.testIndexLevelExclusions (5484)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testIndicesDeletedFromRepository (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testDeleteBlobs (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testWriteRead (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testRequestStats (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testSnapshotAndRestore (5620)
1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testMultipleSnapshotAndRollback (5620)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringQueryPhaseUsingRequestParameter (5760)
1 org.opensearch.discovery.StableClusterManagerDisruptionIT.testStaleClusterManagerNotHijackingMajority (5915)
1 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkCommitsMergeOnIdle (6241)
1 org.opensearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode (6241)
1 org.opensearch.gradle.BuildPluginIT.testInsecureMavenRepository (6406)
1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringQueryPhase (6430)
1 org.opensearch.search.aggregations.bucket.terms.StringTermsIT.classMethod (6465)
1 org.opensearch.upgrade.DetectEsInstallationTaskTests.testTaskExecution (6537)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testDeleteBlobs (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testList (6589)
1 org.opensearch.repositories.s3.S3BlobStoreRepositoryTests.testMultipleSnapshotAndRollback (6589)
1 org.opensearch.client.PitIT.testCreateAndDeletePit (6781)
1 org.opensearch.index.shard.SegmentReplicationIndexShardTests.testReplicaReceivesGenIncrease (6824)
1 org.opensearch.cluster.routing.allocation.decider.ConcurrentRecoveriesAllocationDeciderTests.testClusterConcurrentRecoveries (7022)
1 org.opensearch.search.aggregations.metrics.TDigestPercentilesIT.testMultiValuedFieldWithValueScriptReverse (7208)
1 org.opensearch.cluster.ClusterHealthIT.testHealthOnClusterManagerFailover (7272)
1 org.opensearch.search.SearchCancellationIT.testCancellationDuringFetchPhaseUsingRequestParameter (7318)
1 org.opensearch.action.admin.indices.create.SplitIndexIT.testCreateSplitIndexToN (7415)
1 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (7422)
1 org.opensearch.test.rest.ClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals} (7668)

@dbwiddis
Copy link
Member

dbwiddis commented Aug 8, 2023

How flaky acceptable? I closed #6739 after calculating the expected failure rate of a random-alpha-of-length-5 collision at 1 in 19,164. It failed once on run 12,467. It'll probably fail again in a few years. Is that OK?

@anasalkouz
Copy link
Member Author

Closing this campaign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request flaky-test Random test failure that succeeds on second run Meta Meta issue, not directly linked to a PR
Projects
None yet
Development

No branches or pull requests