Fix flaky integ tests #487

tanqiuliu · 2023-11-07T07:11:36Z

Description

Fixed the flaky integration tests in [BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384. The test failures were due to the model_state has not yet changed to UNDEPLOYED after the undeploy invocation. Added a poller to wait for the state change then move forward.
Updated BaseNeuralSearchIT.findDeployedModels() to ensure model_state = DEPLOYED. This will avoid encountering the following exception when performing a neural search in integ tests.

org.opensearch.client.ResponseException: method [POST], host [http://[::1]:56962], URI [/test-neural-basic-index/_search?size=1], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Model not ready yet. Please run this first: POST /_plugins/_ml/models/7oMmqIsBDNjwFOltGQ_N/_deploy"}],"type":"illegal_argument_exception","reason":"Model not ready yet. Please run this first: POST /_plugins/_ml/models/7oMmqIsBDNjwFOltGQ_N/_deploy"},"status":400}

Issues Resolved

[BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Tanqiu Liu <liutanqiu@gmail.com>

navneet1v · 2023-11-07T16:24:55Z

@tanqiuliu can you add an entry in the change log?

Also, can you run this command in neural search and paste the output as with this seed the tests failed.

gradlew ':integTest' --tests "org.opensearch.neuralsearch.query.NeuralQueryIT.testBasicQuery" -Dtests.seed=DFDA155AC6A13FB4 -Dtests.security.manager=false -Dtests.locale=ga -Dtests.timezone=America/La_Paz

codecov · 2023-11-07T16:31:28Z

Codecov Report

Merging #487 (371a9dd) into main (cda2f82) will not change coverage.
Report is 1 commits behind head on main.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #487   +/-   ##
=========================================
  Coverage     84.37%   84.37%           
  Complexity      498      498           
=========================================
  Files            40       40           
  Lines          1491     1491           
  Branches        228      228           
=========================================
  Hits           1258     1258           
  Misses          133      133           
  Partials        100      100

martin-gaievski · 2023-11-07T21:54:15Z

Some workflows in CI are failing, is this same as initial problem or related/caused by this change?

https://github.com/opensearch-project/neural-search/actions/runs/6781161374/job/18449994428?pr=487#step:4:80

REPRODUCE WITH: gradlew ':integTest' --tests "org.opensearch.neuralsearch.query.NeuralSparseQueryIT.testRescoreQuery" -Dtests.seed=1F7B3ED0649C5765 -Dtests.security.manager=false -Dtests.locale=et -Dtests.timezone=Europe/Sofia -Druntime.java=17
org.opensearch.neuralsearch.query.NeuralSparseQueryIT > testRescoreQuery FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([1F7B3ED0649C5765:34DD958B2EADE4C7]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.common.BaseNeuralSearchIT.getDeployedModelId(BaseNeuralSearchIT.java:7[81](https://github.com/opensearch-project/neural-search/actions/runs/6781161374/job/18449994428?pr=487#step:4:82))
        at org.opensearch.neuralsearch.query.NeuralSparseQueryIT.testRescoreQuery(NeuralSparseQueryIT.java:135)

tanqiuliu · 2023-11-08T07:00:39Z

@navneet1v I executed that command multiple times trying to reproduce the issue in #384, butut it always succeeds. It is possible that sometimes the 3 seconds wait time in previous implementation is not long enough.

tanqiuliu@Tanqius-MacBook-Pro neural-search % ./gradlew ':integTest' --tests "org.opensearch.neuralsearch.query.NeuralQueryIT.testBasicQuery" -Dtests.seed=DFDA155AC6A13FB4 -Dtests.security.manager=false -Dtests.locale=ga -Dtests.timezone=America/La_Paz
Starting a Gradle Daemon (subsequent builds will be faster)
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.1.1
  OS Info               : Mac OS X 14.1 (x86_64)
  JDK Version           : 17 (Amazon Corretto JDK)
  JAVA_HOME             : /Library/Java/JavaVirtualMachines/amazon-corretto-17.jdk/Contents/Home
  Random Testing Seed   : DFDA155AC6A13FB4
  In FIPS 140 mode      : false
=======================================

BUILD SUCCESSFUL in 1m 23s
12 actionable tasks: 7 executed, 5 up-to-date

navneet1v · 2023-11-08T07:26:55Z

@navneet1v I executed that command multiple times trying to reproduce the issue in #384, butut it always succeeds. It is possible that sometimes the 3 seconds wait time in previous implementation is not long enough.

Block (15 lines)

tanqiuliu@Tanqius-MacBook-Pro neural-search % ./gradlew ':integTest' --tests "org.opensearch.neuralsearch.query.NeuralQueryIT.testBasicQuery" -Dtests.seed=DFDA155AC6A13FB4 -Dtests.security.manager=false -Dtests.locale=ga -Dtests.timezone=America/La_Paz
Starting a Gradle Daemon (subsequent builds will be faster)
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.1.1
  OS Info               : Mac OS X 14.1 (x86_64)
  JDK Version           : 17 (Amazon Corretto JDK)
  JAVA_HOME             : /Library/Java/JavaVirtualMachines/amazon-corretto-17.jdk/Contents/Home
  Random Testing Seed   : DFDA155AC6A13FB4
  In FIPS 140 mode      : false
=======================================

BUILD SUCCESSFUL in 1m 23s
12 actionable tasks: 7 executed, 5 up-to-date

Got it. But as a part of reproducing the bug running the same command which failed the tests is the best way to reproduce the issue. But I am glad that its not a problem with src, and its mainly around setup.

Apart from this can you add the entry for this in the changelog.md file so that GH workflow can succeed.

navneet1v · 2023-11-08T07:27:55Z

src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java

@@ -623,6 +624,16 @@ protected void deleteModel(String modelId) {
        );
    }

+    @SneakyThrows


you can remove SneakyThrows from here and add the exception in method signature.

navneet1v · 2023-11-08T07:29:03Z

src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java

+    protected void pollForModelState(String modelId, MLModelState expectedModelState, int intervalMs, int maxAttempts) {
+        for (int i = 0; i < maxAttempts; i++) {
+            Thread.sleep(intervalMs);
+            if (expectedModelState.equals(getModelState(modelId))) {
+                return;
+            }
+        }


To better improve the visibility around why tests are failing, I would say at the end of for loop if the model is not in the correct state we should fail the tests, proper messaging for the failure.

navneet1v · 2023-11-08T07:30:45Z

src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java

+            .filter(
+                hitsMap -> !Objects.isNull(hitsMap)
+                    && hitsMap.containsKey("model_id")
+                    && MLModelState.DEPLOYED.equals(getModelState(hitsMap.get("model_id").toString()))


I think MLModelState is an ENUM, so you can use == to compare 2 enums which will ensure that compile time check fields which are getting equated.

navneet1v · 2023-11-08T07:31:45Z

src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java

+            .filter(
+                hitsMap -> !Objects.isNull(hitsMap)
+                    && hitsMap.containsKey("model_id")
+                    && MLModelState.DEPLOYED.equals(getModelState(hitsMap.get("model_id").toString()))


if no deployed models are found what is the expectation from the tests?

If not deployed model was found, the test will fail in the assertion here: https://github.com/opensearch-project/neural-search/blob/main/src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java#L748
As I mentioned below, there seems to be a latency between task complete and model_state update to DEPLOYED. So I added a poll for model state there as well to avoid the issue.

tanqiuliu · 2023-11-08T07:51:41Z

@martin-gaievski This is another flaky issue. It previously appears as Model not ready yet. as I mentioned in the PR description and was changed to the assertion failure since I added a filter on the model state. The root cause seems to be, during the initialization of an IT, it will invoke loadModel() to deploy the model and poll for the task to complete. However, it seems when the task is updated as COMPLETED, there could be a latency before model_state shown as DEPLOYED, and before that the model is not usable.
I added a poll for model_state = DEPLOYED in loadModel() to avoid running into such cases.

Signed-off-by: Tanqiu Liu <liutanqiu@gmail.com>

tanqiuliu · 2023-11-08T09:13:23Z

src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java

-        // after model undeploy returns, the max interval to update model status is 3s in ml-commons CronJob.
-        Thread.sleep(3000);
+        // wait for model undeploy to complete
+        pollForModelState(modelId, Set.of(MLModelState.UNDEPLOYED, MLModelState.DEPLOY_FAILED), 3000, 5);


Sometime the undeploy action results in a DEPLOY_FAILED state. But this does not block the model from being deleted. So set both UNDEPLOYED and DEPLOY_FAILED as exit state.

tanqiuliu · 2023-11-10T22:58:23Z

@navneet1v Hi Navneet, I've addressed your comments. Will you be able to take a look?

tanqiuliu · 2023-11-15T19:25:32Z

All checks have passed. Will I be able to get a review here?

martin-gaievski · 2023-11-16T01:07:01Z

src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java

@@ -623,6 +626,28 @@ protected void deleteModel(String modelId) {
        );
    }

+    protected void pollForModelState(String modelId, Set<MLModelState> exitModelStates, int intervalMs, int maxAttempts)


can we make interval and maxAttempts a class level constants, like DEFAULT_<item_name> and remove them from a method signature? That can be added later if needed, but most of the times I think default will just work.

martin-gaievski · 2023-11-16T01:14:13Z

src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java

@@ -623,6 +626,28 @@ protected void deleteModel(String modelId) {
        );
    }

+    protected void pollForModelState(String modelId, Set<MLModelState> exitModelStates, int intervalMs, int maxAttempts)
+        throws InterruptedException {
+        MLModelState currentState = null;


can we move this initialization into the loop, it's not required to have it in a top level scope

martin-gaievski · 2023-11-16T01:15:18Z

src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java

@@ -733,11 +758,33 @@ protected Set<String> findDeployedModels() {
        List<Map<String, Object>> innerHitsMap = (List<Map<String, Object>>) hits.get("hits");
        return innerHitsMap.stream()
            .map(hit -> (Map<String, Object>) hit.get("_source"))
-            .filter(hitsMap -> !Objects.isNull(hitsMap) && hitsMap.containsKey("model_id"))
+            .filter(
+                hitsMap -> !Objects.isNull(hitsMap)


Objects.notNull()

martin-gaievski · 2023-11-16T01:17:59Z

src/test/java/org/opensearch/neuralsearch/common/BaseNeuralSearchIT.java

+            EntityUtils.toString(getModelResponse.getEntity()),
+            false
+        );
+        return MLModelState.valueOf((String) getModelResponseJson.get("model_state"));


can we also check if "model_state" key is present in response? I can image in case of service or network error it's possible, we can assert on this

martin-gaievski · 2023-11-16T01:18:22Z

CHANGELOG.md

@@ -7,6 +7,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 ### Features
 ### Enhancements
 ### Bug Fixes
+- Fixed flaky integration tests caused by model_state transition latency. 


please include link to PR

navneet1v · 2024-01-11T21:54:55Z

@tanqiuliu are you still working on the PR? we have to fix this flaky tests for 2.12. Please respond if you are still working on the PR.

tanqiuliu · 2024-01-11T22:45:12Z

@tanqiuliu are you still working on the PR? we have to fix this flaky tests for 2.12. Please respond if you are still working on the PR.

I can address the comments and update the PR when I have time, probably this weekend

navneet1v · 2024-01-30T18:13:14Z

Closing this PR in favor of #559. Please refer GH issue for more details.
#384

Fix flaky integ tests

8ab54c4

Signed-off-by: Tanqiu Liu <liutanqiu@gmail.com>

tanqiuliu requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn and jngz-es as code owners November 7, 2023 07:11

tanqiuliu mentioned this pull request Nov 7, 2023

[BUG] Flaky integ test testBooleanQuery_withNeuralAndBM25Queries, testBasicQuery #384

Closed

navneet1v added bug Something isn't working Maintenance Add support for new versions of OpenSearch/Dashboards from upstream integ-test-failure Integration test failures labels Nov 7, 2023

navneet1v reviewed Nov 8, 2023

View reviewed changes

Address PR comments; Added CHANGELOG

371a9dd

Signed-off-by: Tanqiu Liu <liutanqiu@gmail.com>

tanqiuliu commented Nov 8, 2023

View reviewed changes

martin-gaievski reviewed Nov 16, 2023

View reviewed changes

navneet1v closed this Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky integ tests #487

Fix flaky integ tests #487

tanqiuliu commented Nov 7, 2023

navneet1v commented Nov 7, 2023 •

edited

Loading

codecov bot commented Nov 7, 2023 •

edited

Loading

martin-gaievski commented Nov 7, 2023

tanqiuliu commented Nov 8, 2023

navneet1v commented Nov 8, 2023

navneet1v Nov 8, 2023

tanqiuliu Nov 8, 2023

navneet1v Nov 8, 2023

tanqiuliu Nov 8, 2023

navneet1v Nov 8, 2023

tanqiuliu Nov 8, 2023

navneet1v Nov 8, 2023

tanqiuliu Nov 8, 2023

tanqiuliu commented Nov 8, 2023 •

edited

Loading

tanqiuliu Nov 8, 2023

tanqiuliu commented Nov 10, 2023

tanqiuliu commented Nov 15, 2023

martin-gaievski Nov 16, 2023

martin-gaievski Nov 16, 2023

martin-gaievski Nov 16, 2023

martin-gaievski Nov 16, 2023

martin-gaievski Nov 16, 2023

navneet1v commented Jan 11, 2024

tanqiuliu commented Jan 11, 2024

navneet1v commented Jan 30, 2024

Fix flaky integ tests #487

Fix flaky integ tests #487

Conversation

tanqiuliu commented Nov 7, 2023

Description

Issues Resolved

Check List

navneet1v commented Nov 7, 2023 • edited Loading

codecov bot commented Nov 7, 2023 • edited Loading

Codecov Report

martin-gaievski commented Nov 7, 2023

tanqiuliu commented Nov 8, 2023

navneet1v commented Nov 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tanqiuliu commented Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

tanqiuliu commented Nov 10, 2023

tanqiuliu commented Nov 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navneet1v commented Jan 11, 2024

tanqiuliu commented Jan 11, 2024

navneet1v commented Jan 30, 2024

navneet1v commented Nov 7, 2023 •

edited

Loading

codecov bot commented Nov 7, 2023 •

edited

Loading

tanqiuliu commented Nov 8, 2023 •

edited

Loading