Skip to content

Conversation

@pxsalehi
Copy link
Member

@pxsalehi pxsalehi commented Nov 18, 2022

Enable more logging to verify whether assertAllCancellableTasksAreCancelled is able to always see the cancellable tasks. Also, ensure the task to be cancelled is already on the master to mitigate the cases where a quick cancellation cleans up tasks before the assertion is able to verify their existence.

Closes #91248

@pxsalehi pxsalehi added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. labels Nov 18, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 18, 2022
@pxsalehi pxsalehi added v7.17.8 v8.6.1 Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.7.0 and removed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.7.0 labels Nov 18, 2022
@pxsalehi pxsalehi requested a review from fcofdez November 21, 2022 09:46
@pxsalehi pxsalehi changed the title Better trace logging for RestClusterInfoActionCancellationIT Mitigate failures and add more logging for testGetMappingsCancellation Nov 21, 2022
cancellable.cancel();
// Call the cancellation in a separate thread which is likely to reduce the rare cases where the cancellation
// could finish and clean up the tasks, and result in finding no task for the action at all.
new Thread(cancellable::cancel).start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this won't solve the issue here. What's happening here is that we're hitting the default master timeout (30s) and the task gets removed before we remove the cluster block. I think we should increase the master_timeout if we think this thread takes that long to make progress after we've cancelled the task. Happy to jump on a call to clarify this. 👍

Copy link
Member Author

@pxsalehi pxsalehi Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this with @fcofdez. This issue seems to be related on the weak preconditions that the awaitTaskWithPrefix enforces. Specially, for MasterNodeActions, if the task to be cancelled hits a non-master node first, and it gets send to the master, a cancellation could unregister the task on the master and leading to the assertion not seeing any cancellable tasks. Making sure we wait for seeing the task on the master would mitigate these cases. Although it is still possible that the task gets unregistered before actually running on master, it is much less likely.

assertThat(future.isDone(), equalTo(false));
awaitTaskWithPrefix(actionName);
awaitTaskWithPrefixOnMaster(actionName);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to verify that the task has been attempted by checking the MANAGEMENT pool queue size or completed tasks. The former would be mostly empty and doesn't pass, and the later is just some large arbitrary number. So asserting >0 or not empty doesn't really help here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fcofdez As discussed, lets go with this mitigation and leave the logs, and if this doesn't help, I guess we need to somehow rewrite the test in a way. I believe this would still be very helpful since this was already failing rarely.

@pxsalehi pxsalehi requested a review from fcofdez November 22, 2022 16:57
@pxsalehi pxsalehi changed the title Mitigate failures and add more logging for testGetMappingsCancellation Wait for task on master in testGetMappingsCancellation Nov 22, 2022
Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍. We should keep an eye on this and remove the extra logging once we're confident that this is not an issue anymore.

Comment on lines +81 to +92
awaitTaskWithPrefixOnMaster(actionName);
// To ensure that the task is executing on master, we wait until the first blocked execution of the task registers its cluster state
// observer for further retries. This ensures that a task is not cancelled before we have started its execution, which could result
// in the task being unregistered and the test not being able to find any cancelled tasks.
assertBusy(
() -> assertThat(
internalCluster().getCurrentMasterNodeInstance(ClusterService.class)
.getClusterApplierService()
.getTimeoutClusterStateListenersSize(),
Matchers.greaterThan(0)
)
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into where else we might need this on master or non-master node receiving the task, and move/merge these asserts if necessary in a followup.

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.


@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0, numClientNodes = 0)
@TestLogging(value = "org.elasticsearch.tasks.TaskManager:TRACE,org.elasticsearch.test.TaskAssertions:TRACE", reason = "debugging")
@TestLogging(value = "org.elasticsearch.tasks:TRACE,org.elasticsearch.test.TaskAssertions:TRACE", reason = "debugging")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could remove the logging here?

…sterInfoActionCancellationIT-testGetMappingsCancellation
@pxsalehi
Copy link
Member Author

Thanks Francisco!

@pxsalehi pxsalehi merged commit 454b3e6 into elastic:main Nov 23, 2022
pxsalehi added a commit to pxsalehi/elasticsearch that referenced this pull request Nov 24, 2022
pxsalehi added a commit to pxsalehi/elasticsearch that referenced this pull request Nov 24, 2022
elasticsearchmachine pushed a commit that referenced this pull request Nov 25, 2022
#91916) (#91926)

* Wait for task on master in testGetMappingsCancellation (#91709) (#91916)

* replace List.of usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v7.17.8 v8.6.1 v8.7.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] RestClusterInfoActionCancellationIT testGetMappingsCancellation failing

3 participants