[RFC] Detecting search queries which have been cancelled but continues to run for long.

Co-Author: @jainankitk 

### Problem Statement
As of now in OpenSearch, we have a best effort mechanism to cancel long running search queries via task cancel API ```POST _tasks/<task_id>/_cancel```. We can refer this as soft cancellation as this mechanism works by setting a cancellable flag as true and then requires search thread to check on itself at regular intervals. As we have seen in the past, this mechanism doesn’t work all the time as there are some cases where search query continues to run in lucene tight loops and causes cluster instability.  

While we eventually want to solve this problem by doing a "hard" cancellation, we first want to add a mechanism to detect such queries which were cancelled and still continues to run for a while. With the introduction of [Search backpressure](https://github.com/opensearch-project/OpenSearch/issues/1181), there would be a lot of such cases and we need a mechanism to track them. This will also help us test the effectiveness of existing search cancellation, identify tight loops within OS/lucene and eventually build a case to hard cancel rogues queries.

> Note: This problem deals with detecting cancelled queries at a shard level and not at coordinator level.

### Solution

There are two parts to the solution. 

1. First part comprises of detecting such long running cancelled queries.
2. Second part comprises of dumping such queries along with the stacktrace for later analysis.

We will only focus on the Part 1 for now.

#### **Introduction of a new field in Task API to detect time elapsed for cancelled query**
We will introduce a new variable as cancellationTime in [CancellableTask](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/tasks/CancellableTask.java). Every time a task is cancelled, we will capture this timestamp which will be later used to calculate time elapsed after this task was cancelled. We will also append this information in Tasks API by introducing a new field running_time_since_cancellation_nanos. 

This will give us additional information to display such tasks especially for search.

```
"tasks" : {
    "oTUltX4IQMOUUVeiohTt8A:464" : {
    "node" : "oTUltX4IQMOUUVeiohTt8A",
    "id" : 464,
    "type" : "transport",
    "action" : "indices:data/read/search",
    "description" : "indices[test], types[test], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
    "start_time_in_millis" : 1483478610008,
    "running_time_in_nanos" : 139913830000,
    "cancellable" : true,
    "cancelled" : true,
    "cancelled_at_millis": 1483479610008
    }
}

```

#### **Separate monitoring thread to detect such queries**

As part of this approach, we will have a monitor thread(out of generic threadpool) within opensearch running every Y seconds and extracting all such search tasks which have been running since X seconds after cancellation. Here Y << X.

In every such run, this monitoring thread will fetch all the current running search shard tasks using `TaskResourceTrackingService::getResourceAwareTasks` [here](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/tasks/TaskResourceTrackingService.java#L206). We will then filter out tasks which have been cancelled and have been running for more than X seconds using.


X ~ 10s - Reason to keep this value so that it is long enough to detect tight loops within OS/lucene and short enough to avoid missing long running queries after cancellation. Configurable.
Y ~ 5s - As X is 10s, keeping this 5 so that we don’t check too frequently and keeping enough buffer to incorporate execution time of detection. Configurable.

Below are the components that will be newly introduced:

**SearchCancellationMonitoringService**: This will run every Y seconds and fetch all current running searchShardTasks. And then filter tasks which have been cancelled and running since X seconds. Will store such tasks count in local variable.
**SearchCancellationMonitoringStats**: Maintains all stats related to search cancellation from above.  Sample response for: _nodes/stats/search_cancellation
```
{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "runTask",
  "nodes": {
    "T7aqO6zaQX-lt8XBWBYLsA": {
      "timestamp": 1667409521070,
      "name": "runTask-0",
      "transport_address": "127.0.0.1:9300",
      "host": "127.0.0.1",
      "ip": "127.0.0.1:9300",
      "roles": [],
      "attributes": {
        "testattr": "test",
        "shard_indexing_pressure_enabled": "true"
      },
      "search_cancellation": {
        "search_shard_task": {
          "post_cancellation_stats": {
             "current_query_count": 12,
             "total_query_count": 100
          }
        }
      }
    }
  }
}
```

Using above approach, we will have some insights on such queries. OS clients can also custom cron to call node/stats API to monitor such queries if required.

## Future plan

We can eventually have a logic to dump such queries into log as well(Part 2). And a long term plan is to hard cancel such queries but requires more deep dive as there seems no easy way for now to do that.


**Describe alternatives you've considered**
- Only other way is to let OS clients build custom cron solution on top of Task API changes mentioned above and track such API's. But in case we want to extract/dump such rogue queries, it won't be possible using this.
- Using Search Backpressure Service to detect such queries. Search backpressure service runs after 1 second to identify potential tasks which are resource intensive and needs to be cancelled. We could have potentially used this to identify long running tasks. But it couples our logic with search backpressure. And we will have to use the same backpressure stats now which doesn't seem right.

**Additional context**
Add any other context or screenshots about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Detecting search queries which have been cancelled but continues to run for long. #6953

Problem Statement

Solution

Introduction of a new field in Task API to detect time elapsed for cancelled query

Separate monitoring thread to detect such queries

Future plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development