Wait for task on master in testGetMappingsCancellation #91709

pxsalehi · 2022-11-18T15:02:21Z

Enable more logging to verify whether assertAllCancellableTasksAreCancelled is able to always see the cancellable tasks. Also, ensure the task to be cancelled is already on the master to mitigate the cases where a quick cancellation cleans up tasks before the assertion is able to verify their existence.

Closes #91248

elasticsearchmachine · 2022-11-18T15:02:45Z

Pinging @elastic/es-distributed (Team:Distributed)

fcofdez · 2022-11-21T16:11:45Z

...t-http/src/javaRestTest/java/org/elasticsearch/http/RestClusterInfoActionCancellationIT.java

-        cancellable.cancel();
+        // Call the cancellation in a separate thread which is likely to reduce the rare cases where the cancellation
+        // could finish and clean up the tasks, and result in finding no task for the action at all.
+        new Thread(cancellable::cancel).start();


I think this won't solve the issue here. What's happening here is that we're hitting the default master timeout (30s) and the task gets removed before we remove the cluster block. I think we should increase the master_timeout if we think this thread takes that long to make progress after we've cancelled the task. Happy to jump on a call to clarify this. 👍

We discussed this with @fcofdez. This issue seems to be related on the weak preconditions that the awaitTaskWithPrefix enforces. Specially, for MasterNodeActions, if the task to be cancelled hits a non-master node first, and it gets send to the master, a cancellation could unregister the task on the master and leading to the assertion not seeing any cancellable tasks. Making sure we wait for seeing the task on the master would mitigate these cases. Although it is still possible that the task gets unregistered before actually running on master, it is much less likely.

…sterInfoActionCancellationIT-testGetMappingsCancellation

pxsalehi · 2022-11-22T16:56:56Z

...t-http/src/javaRestTest/java/org/elasticsearch/http/RestClusterInfoActionCancellationIT.java

        assertThat(future.isDone(), equalTo(false));
-        awaitTaskWithPrefix(actionName);
+        awaitTaskWithPrefixOnMaster(actionName);



I tried to verify that the task has been attempted by checking the MANAGEMENT pool queue size or completed tasks. The former would be mostly empty and doesn't pass, and the later is just some large arbitrary number. So asserting >0 or not empty doesn't really help here.

@fcofdez As discussed, lets go with this mitigation and leave the logs, and if this doesn't help, I guess we need to somehow rewrite the test in a way. I believe this would still be very helpful since this was already failing rarely.

fcofdez

LGTM 👍. We should keep an eye on this and remove the extra logging once we're confident that this is not an issue anymore.

pxsalehi · 2022-11-23T12:17:28Z

...t-http/src/javaRestTest/java/org/elasticsearch/http/RestClusterInfoActionCancellationIT.java

+        awaitTaskWithPrefixOnMaster(actionName);
+        // To ensure that the task is executing on master, we wait until the first blocked execution of the task registers its cluster state
+        // observer for further retries. This ensures that a task is not cancelled before we have started its execution, which could result
+        // in the task being unregistered and the test not being able to find any cancelled tasks.
+        assertBusy(
+            () -> assertThat(
+                internalCluster().getCurrentMasterNodeInstance(ClusterService.class)
+                    .getClusterApplierService()
+                    .getTimeoutClusterStateListenersSize(),
+                Matchers.greaterThan(0)
+            )
+        );


I'll look into where else we might need this on master or non-master node receiving the task, and move/merge these asserts if necessary in a followup.

fcofdez

LGTM.

fcofdez · 2022-11-23T12:45:22Z

...t-http/src/javaRestTest/java/org/elasticsearch/http/RestClusterInfoActionCancellationIT.java


 @ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0, numClientNodes = 0)
-@TestLogging(value = "org.elasticsearch.tasks.TaskManager:TRACE,org.elasticsearch.test.TaskAssertions:TRACE", reason = "debugging")
+@TestLogging(value = "org.elasticsearch.tasks:TRACE,org.elasticsearch.test.TaskAssertions:TRACE", reason = "debugging")


I guess we could remove the logging here?

…sterInfoActionCancellationIT-testGetMappingsCancellation

pxsalehi · 2022-11-23T13:36:45Z

Thanks Francisco!

elastic#91916)

#91916) (#91926) * Wait for task on master in testGetMappingsCancellation (#91709) (#91916) * replace List.of usage

Better trace logging for RestClusterInfoActionCancellationIT

39880cb

pxsalehi added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. labels Nov 18, 2022

elasticsearchmachine added the v8.7.0 label Nov 18, 2022

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 18, 2022

Cancel in parallel

4976298

pxsalehi requested a review from fcofdez November 21, 2022 09:46

pxsalehi changed the title ~~Better trace logging for RestClusterInfoActionCancellationIT~~ Mitigate failures and add more logging for testGetMappingsCancellation Nov 21, 2022

fcofdez requested changes Nov 21, 2022

View reviewed changes

pxsalehi added 2 commits November 22, 2022 15:07

Merge remote-tracking branch 'upstream/main' into ps221116-ci-RestClu…

76fbbc5

…sterInfoActionCancellationIT-testGetMappingsCancellation

Ensure task is on master

3b4e717

pxsalehi commented Nov 22, 2022

View reviewed changes

pxsalehi requested a review from fcofdez November 22, 2022 16:57

pxsalehi mentioned this pull request Nov 22, 2022

[CI] RestClusterInfoActionCancellationIT testGetMappingsCancellation failing #91248

Closed

pxsalehi changed the title ~~Mitigate failures and add more logging for testGetMappingsCancellation~~ Wait for task on master in testGetMappingsCancellation Nov 22, 2022

private

49f8e82

fcofdez approved these changes Nov 23, 2022

View reviewed changes

wait for cs observer

0e69bb4

pxsalehi commented Nov 23, 2022

View reviewed changes

pxsalehi requested review from DaveCTurner and fcofdez November 23, 2022 12:18

fcofdez approved these changes Nov 23, 2022

View reviewed changes

remove test logging

4a3ce3d

Merge remote-tracking branch 'upstream/main' into ps221116-ci-RestClu…

52c760f

…sterInfoActionCancellationIT-testGetMappingsCancellation

pxsalehi merged commit 454b3e6 into elastic:main Nov 23, 2022

pxsalehi added a commit to pxsalehi/elasticsearch that referenced this pull request Nov 24, 2022

Wait for task on master in testGetMappingsCancellation (elastic#91709)

e5caf9c

pxsalehi mentioned this pull request Nov 24, 2022

Wait for task on master in testGetMappingsCancellation (#91709) #91916

Merged

pxsalehi added a commit that referenced this pull request Nov 24, 2022

Wait for task on master in testGetMappingsCancellation (#91709) (#91916)

744ed93

pxsalehi mentioned this pull request Nov 24, 2022

[7.17] Wait for task on master in testGetMappingsCancellation (#91709) (#91916) #91926

Merged

pxsalehi added a commit to pxsalehi/elasticsearch that referenced this pull request Nov 24, 2022

Wait for task on master in testGetMappingsCancellation (elastic#91709) (

dde431d

elastic#91916)

elasticsearchmachine pushed a commit that referenced this pull request Nov 25, 2022

[7.17] Wait for task on master in testGetMappingsCancellation (#91709) (

cd96706

#91916) (#91926) * Wait for task on master in testGetMappingsCancellation (#91709) (#91916) * replace List.of usage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wait for task on master in testGetMappingsCancellation #91709

Wait for task on master in testGetMappingsCancellation #91709

Uh oh!

pxsalehi commented Nov 18, 2022 •

edited

Loading

Uh oh!

elasticsearchmachine commented Nov 18, 2022

Uh oh!

fcofdez Nov 21, 2022

Uh oh!

pxsalehi Nov 22, 2022 •

edited

Loading

Uh oh!

pxsalehi Nov 22, 2022

Uh oh!

pxsalehi Nov 22, 2022

Uh oh!

fcofdez left a comment •

edited

Loading

Uh oh!

pxsalehi Nov 23, 2022

Uh oh!

fcofdez left a comment

Uh oh!

fcofdez Nov 23, 2022

Uh oh!

pxsalehi commented Nov 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Wait for task on master in testGetMappingsCancellation #91709

Wait for task on master in testGetMappingsCancellation #91709

Uh oh!

Conversation

pxsalehi commented Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 18, 2022

Uh oh!

fcofdez Nov 21, 2022

Choose a reason for hiding this comment

Uh oh!

pxsalehi Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

pxsalehi Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

fcofdez left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi Nov 23, 2022

Choose a reason for hiding this comment

Uh oh!

fcofdez left a comment

Choose a reason for hiding this comment

Uh oh!

fcofdez Nov 23, 2022

Choose a reason for hiding this comment

Uh oh!

pxsalehi commented Nov 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pxsalehi commented Nov 18, 2022 •

edited

Loading

pxsalehi Nov 22, 2022 •

edited

Loading

fcofdez left a comment •

edited

Loading