Skip to content

Automatic task cancellation mechanism is invalid on coordinating node's reduce phase #70347

Closed
@lqbilbo

Description

@lqbilbo

Elasticsearch version: (7.6.2)

Plugins installed: [analysis-ik]

JVM version (openjdk 13.0.2):

OS version (Linux n23-161-209 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u6 (2018-10-08) x86_64 GNU/Linux):

Description of the problem including expected versus actual behavior:

We expect the long-running aggregation search task will be cancelled immediately when the channel closed(#43332), but the actual behavior is that when we cancel the client request on the query phase it can cancel the task as we expected and throw the exception '...Caused by: java.lang.IllegalStateException: Task cancelled before it started: by user request' and the search task is paused. But if we cancel the client request during reduce phase, the log will be 'Received ban for the parent [N78L0bWWQEuSLPvitGDBxw:10714] on the node [N78L0bWWQEuSLPvitGDBxw], reason: [by user request]' but the search task is still running.

Steps to reproduce:

Please include a minimal but complete recreation of the problem,
including (e.g.) index creation, mappings, settings, query etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.

  1. the query body is as follows:
{
   "query": {
       "bool": {
           "filter": [
               {
                   "terms": {
                       "insurance_id": [
                           111,
                           222,
                           333,
                           444,
                           555,
                           666,
                           777
                           ...(150 more)
                       ]
                   }
               },
               {
                   "range": {
                       "start_time": {
                           "lt": "2020-12-31 00:00:00",
                           "gte": "2020-12-01 00:00:00",
                           "time_zone": "+00:00"
                       }
                   }
               }
           ]
       }
   },
   "aggs": {
       "start_time": {
           "date_histogram": {
               "field": "start_time",
               "interval": "day",
               "extended_bounds": {
                   "max": "2020-12-31 00:00:00",
                   "min": "2020-12-01 00:00:00"
               },
               "min_doc_count": 0,
               "format": "yyyy-MM-dd HH:mm:ss"
           },
           "aggs": {
               "address_id": {
                   "terms": {
                       "field": "address_id",
                       "size": 100000
                   },
                   "aggs": {
                       "age": {
                           "sum": {
                               "field": "age"
                           }
                       },
                       "salary": {
                           "sum": {
                               "field": "salary"
                           }
                       },
                       "height": {
                           "sum": {
                               "field": "height"
                           }
                       },
                       "weight": {
                           "sum": {
                               "field": "weight"
                           }
                       },
                       "english_score": {
                           "sum": {
                               "field": "english_score"
                           }
                       },
                       "math_score": {
                           "sum": {
                               "field": "math_score"
                           }
                       }
                   }
               }
           }
       }
   },
   "size": 0
}
  1. request params:
allow_partial_search_results=false,
timeout=1000ms
  1. index shards: 8 primary + 8 replica

Provide logs (if relevant): I take notes about the logs on coordinating node.
(1) when I cancel the client request on query phase, the log is:

org.elasticsearch.action.search.SearchPhaseExecutionException: Partial shards failure
   at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:550) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
   at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:327) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
   at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:579) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
   at org.elasticsearch.action.search.AbstractSearchAsyncAction.successfulShardExecution(AbstractSearchAsyncAction.java:502) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
   at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:490) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
   at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:67) [elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
   ...

Caused by: org.elasticsearch.ElasticsearchException$1: Task cancelled before it started: by user request
   at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:644) ~[elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
   at org.elasticsearch.action.search.SearchPhaseExecutionException.guessRootCauses(SearchPhaseExecutionException.java:167) ~[elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
   at org.elasticsearch.action.search.SearchPhaseExecutionException.getCause(SearchPhaseExecutionException.java:112) ~[elasticsearch-7.6.2-SNAPSHOT.jar:7.6.2-SNAPSHOT]
   at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:139) [log4j-core-2.11.1.jar:2.11.1]
   at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:122) [log4j-core-2.11.1.jar:2.11.1]
   at org.apache.logging.log4j.core.impl.MutableLogEvent.getThrownProxy(MutableLogEvent.java:351) [log4j-core-2.11.1.jar:2.11.1]
   ...

the search task is cancelled while I cancel the client request, so I don't take notes of time.

(2) when I cancel the client request on reduce phase, the log is:

[2021-03-12T17:17:30,201][DEBUG][o.e.a.s.TransportSearchAction] AbstractSearchAsyncAction.run begin, current time is [48568493537953821]
[2021-03-12T17:17:42,463][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] Received ban for the parent [N78L0bWWQEuSLPvitGDBxw:10714] on the node [N78L0bWWQEuSLPvitGDBxw], reason: [by user request]
[2021-03-12T17:18:05,123][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] Sending remove ban for tasks with the parent [N78L0bWWQEuSLPvitGDBxw:10714] to the node [3gtuxzK2SY-zuO-uwXsA9A]
[2021-03-12T17:18:05,122][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] Sending remove ban for tasks with the parent [N78L0bWWQEuSLPvitGDBxw:10714] to the node [qRGABA5GQayYidPQuW1Uow]
[2021-03-12T17:18:05,122][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] Sending remove ban for tasks with the parent [N78L0bWWQEuSLPvitGDBxw:10714] to the node [CTb_1tG5RUSSAUp1Kzlxuw]
[2021-03-12T17:18:05,122][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] Sending remove ban for tasks with the parent [N78L0bWWQEuSLPvitGDBxw:10714] to the node [3ruMY6IJQs6yKQtPGLF7kw]
...

the process lasts nearly 35 seconds.

(3) If I don't cancel the client request, the response is:

"took" : 36027,
  "timed_out" : false,
  "_shards" : {
    "total" : 240,
    "successful" : 240,
    "skipped" : 0,
    "failed" : 0
  },

the process lasts nearly 36 seconds. So you can see when I cancel the client request on reduce phase, it still cost such a long time as I don't cancel the client request.

Metadata

Metadata

Assignees

Labels

:Distributed Coordination/Task ManagementIssues for anything around the Tasks API - both persistent and node level.:Search/SearchSearch-related issues that do not fall into other categoriesTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.Team:SearchMeta label for search team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions