-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Wait for task on master in testGetMappingsCancellation #91709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for task on master in testGetMappingsCancellation #91709
Conversation
|
Pinging @elastic/es-distributed (Team:Distributed) |
| cancellable.cancel(); | ||
| // Call the cancellation in a separate thread which is likely to reduce the rare cases where the cancellation | ||
| // could finish and clean up the tasks, and result in finding no task for the action at all. | ||
| new Thread(cancellable::cancel).start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't solve the issue here. What's happening here is that we're hitting the default master timeout (30s) and the task gets removed before we remove the cluster block. I think we should increase the master_timeout if we think this thread takes that long to make progress after we've cancelled the task. Happy to jump on a call to clarify this. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this with @fcofdez. This issue seems to be related on the weak preconditions that the awaitTaskWithPrefix enforces. Specially, for MasterNodeActions, if the task to be cancelled hits a non-master node first, and it gets send to the master, a cancellation could unregister the task on the master and leading to the assertion not seeing any cancellable tasks. Making sure we wait for seeing the task on the master would mitigate these cases. Although it is still possible that the task gets unregistered before actually running on master, it is much less likely.
…sterInfoActionCancellationIT-testGetMappingsCancellation
| assertThat(future.isDone(), equalTo(false)); | ||
| awaitTaskWithPrefix(actionName); | ||
| awaitTaskWithPrefixOnMaster(actionName); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to verify that the task has been attempted by checking the MANAGEMENT pool queue size or completed tasks. The former would be mostly empty and doesn't pass, and the later is just some large arbitrary number. So asserting >0 or not empty doesn't really help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fcofdez As discussed, lets go with this mitigation and leave the logs, and if this doesn't help, I guess we need to somehow rewrite the test in a way. I believe this would still be very helpful since this was already failing rarely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍. We should keep an eye on this and remove the extra logging once we're confident that this is not an issue anymore.
| awaitTaskWithPrefixOnMaster(actionName); | ||
| // To ensure that the task is executing on master, we wait until the first blocked execution of the task registers its cluster state | ||
| // observer for further retries. This ensures that a task is not cancelled before we have started its execution, which could result | ||
| // in the task being unregistered and the test not being able to find any cancelled tasks. | ||
| assertBusy( | ||
| () -> assertThat( | ||
| internalCluster().getCurrentMasterNodeInstance(ClusterService.class) | ||
| .getClusterApplierService() | ||
| .getTimeoutClusterStateListenersSize(), | ||
| Matchers.greaterThan(0) | ||
| ) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into where else we might need this on master or non-master node receiving the task, and move/merge these asserts if necessary in a followup.
fcofdez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
|
||
| @ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0, numClientNodes = 0) | ||
| @TestLogging(value = "org.elasticsearch.tasks.TaskManager:TRACE,org.elasticsearch.test.TaskAssertions:TRACE", reason = "debugging") | ||
| @TestLogging(value = "org.elasticsearch.tasks:TRACE,org.elasticsearch.test.TaskAssertions:TRACE", reason = "debugging") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could remove the logging here?
…sterInfoActionCancellationIT-testGetMappingsCancellation
|
Thanks Francisco! |
Enable more logging to verify whether
assertAllCancellableTasksAreCancelledis able to always see the cancellable tasks. Also, ensure the task to be cancelled is already on the master to mitigate the cases where a quick cancellation cleans up tasks before the assertion is able to verify their existence.Closes #91248