Skip to content

[RFC] Detecting search queries which have been cancelled but continues to run for long. #6953

Closed
@sgup432

Description

Co-Author: @jainankitk

Problem Statement

As of now in OpenSearch, we have a best effort mechanism to cancel long running search queries via task cancel API POST _tasks/<task_id>/_cancel. We can refer this as soft cancellation as this mechanism works by setting a cancellable flag as true and then requires search thread to check on itself at regular intervals. As we have seen in the past, this mechanism doesn’t work all the time as there are some cases where search query continues to run in lucene tight loops and causes cluster instability.

While we eventually want to solve this problem by doing a "hard" cancellation, we first want to add a mechanism to detect such queries which were cancelled and still continues to run for a while. With the introduction of Search backpressure, there would be a lot of such cases and we need a mechanism to track them. This will also help us test the effectiveness of existing search cancellation, identify tight loops within OS/lucene and eventually build a case to hard cancel rogues queries.

Note: This problem deals with detecting cancelled queries at a shard level and not at coordinator level.

Solution

There are two parts to the solution.

  1. First part comprises of detecting such long running cancelled queries.
  2. Second part comprises of dumping such queries along with the stacktrace for later analysis.

We will only focus on the Part 1 for now.

Introduction of a new field in Task API to detect time elapsed for cancelled query

We will introduce a new variable as cancellationTime in CancellableTask. Every time a task is cancelled, we will capture this timestamp which will be later used to calculate time elapsed after this task was cancelled. We will also append this information in Tasks API by introducing a new field running_time_since_cancellation_nanos.

This will give us additional information to display such tasks especially for search.

"tasks" : {
    "oTUltX4IQMOUUVeiohTt8A:464" : {
    "node" : "oTUltX4IQMOUUVeiohTt8A",
    "id" : 464,
    "type" : "transport",
    "action" : "indices:data/read/search",
    "description" : "indices[test], types[test], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
    "start_time_in_millis" : 1483478610008,
    "running_time_in_nanos" : 139913830000,
    "cancellable" : true,
    "cancelled" : true,
    "cancelled_at_millis": 1483479610008
    }
}

Separate monitoring thread to detect such queries

As part of this approach, we will have a monitor thread(out of generic threadpool) within opensearch running every Y seconds and extracting all such search tasks which have been running since X seconds after cancellation. Here Y << X.

In every such run, this monitoring thread will fetch all the current running search shard tasks using TaskResourceTrackingService::getResourceAwareTasks here. We will then filter out tasks which have been cancelled and have been running for more than X seconds using.

X ~ 10s - Reason to keep this value so that it is long enough to detect tight loops within OS/lucene and short enough to avoid missing long running queries after cancellation. Configurable.
Y ~ 5s - As X is 10s, keeping this 5 so that we don’t check too frequently and keeping enough buffer to incorporate execution time of detection. Configurable.

Below are the components that will be newly introduced:

SearchCancellationMonitoringService: This will run every Y seconds and fetch all current running searchShardTasks. And then filter tasks which have been cancelled and running since X seconds. Will store such tasks count in local variable.
SearchCancellationMonitoringStats: Maintains all stats related to search cancellation from above. Sample response for: _nodes/stats/search_cancellation

{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "runTask",
  "nodes": {
    "T7aqO6zaQX-lt8XBWBYLsA": {
      "timestamp": 1667409521070,
      "name": "runTask-0",
      "transport_address": "127.0.0.1:9300",
      "host": "127.0.0.1",
      "ip": "127.0.0.1:9300",
      "roles": [],
      "attributes": {
        "testattr": "test",
        "shard_indexing_pressure_enabled": "true"
      },
      "search_cancellation": {
        "search_shard_task": {
          "post_cancellation_stats": {
             "current_query_count": 12,
             "total_query_count": 100
          }
        }
      }
    }
  }
}

Using above approach, we will have some insights on such queries. OS clients can also custom cron to call node/stats API to monitor such queries if required.

Future plan

We can eventually have a logic to dump such queries into log as well(Part 2). And a long term plan is to hard cancel such queries but requires more deep dive as there seems no easy way for now to do that.

Describe alternatives you've considered

  • Only other way is to let OS clients build custom cron solution on top of Task API changes mentioned above and track such API's. But in case we want to extract/dump such rogue queries, it won't be possible using this.
  • Using Search Backpressure Service to detect such queries. Search backpressure service runs after 1 second to identify potential tasks which are resource intensive and needs to be cancelled. We could have potentially used this to identify long running tasks. But it couples our logic with search backpressure. And we will have to use the same backpressure stats now which doesn't seem right.

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Assignees

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions