Description
Elasticsearch version: (7.6.2)
Plugins installed: [analysis-ik]
JVM version (openjdk 13.0.2):
OS version (Linux n23-161-209 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u6 (2018-10-08) x86_64 GNU/Linux):
Description of the problem including expected versus actual behavior:
We expect the long-running aggregation search task will be cancelled immediately when the channel closed(#43332), but the actual behavior is that when we cancel the client request on the query phase it can cancel the task as we expected and throw the exception '...Caused by: java.lang.IllegalStateException: Task cancelled before it started: by user request' and the search task is paused. But if we cancel the client request during reduce phase, the log will be 'Received ban for the parent [N78L0bWWQEuSLPvitGDBxw:10714] on the node [N78L0bWWQEuSLPvitGDBxw], reason: [by user request]' but the search task is still running.
Steps to reproduce:
Please include a minimal but complete recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.
- the query body is as follows:
{
"query": {
"bool": {
"filter": [
{
"terms": {
"insurance_id": [
111,
222,
333,
444,
555,
666,
777
...(150 more)
]
}
},
{
"range": {
"start_time": {
"lt": "2020-12-31 00:00:00",
"gte": "2020-12-01 00:00:00",
"time_zone": "+00:00"
}
}
}
]
}
},
"aggs": {
"start_time": {
"date_histogram": {
"field": "start_time",
"interval": "day",
"extended_bounds": {
"max": "2020-12-31 00:00:00",
"min": "2020-12-01 00:00:00"
},
"min_doc_count": 0,
"format": "yyyy-MM-dd HH:mm:ss"
},
"aggs": {
"address_id": {
"terms": {
"field": "address_id",
"size": 100000
},
"aggs": {
"age": {
"sum": {
"field": "age"
}
},
"salary": {
"sum": {
"field": "salary"
}
},
"height": {
"sum": {
"field": "height"
}
},
"weight": {
"sum": {
"field": "weight"
}
},
"english_score": {
"sum": {
"field": "english_score"
}
},
"math_score": {
"sum": {
"field": "math_score"
}
}
}
}
}
}
},
"size": 0
}
- request params:
allow_partial_search_results=false,
timeout=1000ms
- index shards: 8 primary + 8 replica
Provide logs (if relevant): I take notes about the logs on coordinating node.
(1) when I cancel the client request on query phase, the log is:
org.elasticsearch.action.search.SearchPhaseExecutionException: Partial shards failure
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:550) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:327) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:579) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.successfulShardExecution(AbstractSearchAsyncAction.java:502) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:490) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:67) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
...
Caused by: org.elasticsearch.ElasticsearchException$1: Task cancelled before it started: by user request
at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:644) ~[elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
at org.elasticsearch.action.search.SearchPhaseExecutionException.guessRootCauses(SearchPhaseExecutionException.java:167) ~[elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
at org.elasticsearch.action.search.SearchPhaseExecutionException.getCause(SearchPhaseExecutionException.java:112) ~[elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:139) [log4j-core-2.11.1.jar:2.11.1]
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:122) [log4j-core-2.11.1.jar:2.11.1]
at org.apache.logging.log4j.core.impl.MutableLogEvent.getThrownProxy(MutableLogEvent.java:351) [log4j-core-2.11.1.jar:2.11.1]
...
the search task is cancelled while I cancel the client request, so I don't take notes of time.
(2) when I cancel the client request on reduce phase, the log is:
[2021-03-12T17:17:30,201][DEBUG][o.e.a.s.TransportSearchAction] AbstractSearchAsyncAction.run begin, current time is [48568493537953821]
[2021-03-12T17:17:42,463][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] Received ban for the parent [N78L0bWWQEuSLPvitGDBxw:10714] on the node [N78L0bWWQEuSLPvitGDBxw], reason: [by user request]
[2021-03-12T17:18:05,123][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] Sending remove ban for tasks with the parent [N78L0bWWQEuSLPvitGDBxw:10714] to the node [3gtuxzK2SY-zuO-uwXsA9A]
[2021-03-12T17:18:05,122][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] Sending remove ban for tasks with the parent [N78L0bWWQEuSLPvitGDBxw:10714] to the node [qRGABA5GQayYidPQuW1Uow]
[2021-03-12T17:18:05,122][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] Sending remove ban for tasks with the parent [N78L0bWWQEuSLPvitGDBxw:10714] to the node [CTb_1tG5RUSSAUp1Kzlxuw]
[2021-03-12T17:18:05,122][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] Sending remove ban for tasks with the parent [N78L0bWWQEuSLPvitGDBxw:10714] to the node [3ruMY6IJQs6yKQtPGLF7kw]
...
the process lasts nearly 35 seconds.
(3) If I don't cancel the client request, the response is:
"took" : 36027,
"timed_out" : false,
"_shards" : {
"total" : 240,
"successful" : 240,
"skipped" : 0,
"failed" : 0
},
the process lasts nearly 36 seconds. So you can see when I cancel the client request on reduce phase, it still cost such a long time as I don't cancel the client request.