Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Concurrent Segment Search] Explore different metrics/stats which will be useful with concurrent segment search #7359

Closed
sohami opened this issue May 2, 2023 · 7 comments · Fixed by #9622
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request

Comments

@sohami
Copy link
Collaborator

sohami commented May 2, 2023

Placeholder tasks to explore and add different metrics which will be useful for concurrent segment search execution model. These metrics can: i) provide insights into the performance of shard level requests (min/max/avg latencies across request at index/node level), ii) how many requests used concurrent search path vs sequential path iii) concurrency used across the requests at index/node level, etc

@sohami sohami added enhancement Enhancement or improvement to existing feature or request untriaged labels May 2, 2023
@mch2 mch2 added the Search Search query, autocomplete ...etc label May 9, 2023
@macohen macohen removed the Search Search query, autocomplete ...etc label May 15, 2023
@jed326
Copy link
Collaborator

jed326 commented Aug 23, 2023

Existing metrics:

@jed326
Copy link
Collaborator

jed326 commented Aug 28, 2023

New Metrics:

Metric Description
search.concurrent_query_total The total number of query operations using concurrent segment search.
search.concurrent_query_time_in_millis The total time for all query operations using concurrent segment search, in milliseconds.
search.concurrent_query_current The number of query operations using concurrent segment search that are currently running.
search.query_avg_concurrency The average concurrency of query operations using concurrent segment serch.

Sample requests for reference (without new metrics):

curl -X GET "localhost:9200/my-index-000001/_stats/search?pretty"
{
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "search" : {
        "open_contexts" : 0,
        "query_total" : 0,
        "query_time_in_millis" : 0,
        "query_current" : 0,
        "fetch_total" : 0,
        "fetch_time_in_millis" : 0,
        "fetch_current" : 0,
        "scroll_total" : 0,
        "scroll_time_in_millis" : 0,
        "scroll_current" : 0,
        "point_in_time_total" : 0,
        "point_in_time_time_in_millis" : 0,
        "point_in_time_current" : 0,
        "suggest_total" : 0,
        "suggest_time_in_millis" : 0,
        "suggest_current" : 0
      }
    },
    "total" : {
      "search" : {
        "open_contexts" : 0,
        "query_total" : 0,
        "query_time_in_millis" : 0,
        "query_current" : 0,
        "fetch_total" : 0,
        "fetch_time_in_millis" : 0,
        "fetch_current" : 0,
        "scroll_total" : 0,
        "scroll_time_in_millis" : 0,
        "scroll_current" : 0,
        "point_in_time_total" : 0,
        "point_in_time_time_in_millis" : 0,
        "point_in_time_current" : 0,
        "suggest_total" : 0,
        "suggest_time_in_millis" : 0,
        "suggest_current" : 0
      }
    }
  },
  "indices" : {
    "my-index-000001" : {
      "uuid" : "FQKBJoW9T-KdlI8KHLCThA",
      "primaries" : {
        "search" : {
          "open_contexts" : 0,
          "query_total" : 0,
          "query_time_in_millis" : 0,
          "query_current" : 0,
          "fetch_total" : 0,
          "fetch_time_in_millis" : 0,
          "fetch_current" : 0,
          "scroll_total" : 0,
          "scroll_time_in_millis" : 0,
          "scroll_current" : 0,
          "point_in_time_total" : 0,
          "point_in_time_time_in_millis" : 0,
          "point_in_time_current" : 0,
          "suggest_total" : 0,
          "suggest_time_in_millis" : 0,
          "suggest_current" : 0
        }
      },
      "total" : {
        "search" : {
          "open_contexts" : 0,
          "query_total" : 0,
          "query_time_in_millis" : 0,
          "query_current" : 0,
          "fetch_total" : 0,
          "fetch_time_in_millis" : 0,
          "fetch_current" : 0,
          "scroll_total" : 0,
          "scroll_time_in_millis" : 0,
          "scroll_current" : 0,
          "point_in_time_total" : 0,
          "point_in_time_time_in_millis" : 0,
          "point_in_time_current" : 0,
          "suggest_total" : 0,
          "suggest_time_in_millis" : 0,
          "suggest_current" : 0
        }
      }
    }
  }
}
curl "localhost:9200/_nodes/stats/indices/search?pretty"
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "runTask",
  "nodes" : {
    "5zw1q4MxTFyoobBDhUESEQ" : {
      "timestamp" : 1693249140022,
      "name" : "runTask-0",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "roles" : [
        "cluster_manager",
        "data",
        "ingest",
        "remote_cluster_client"
      ],
      "attributes" : {
        "testattr" : "test",
        "shard_indexing_pressure_enabled" : "true"
      },
      "indices" : {
        "search" : {
          "open_contexts" : 0,
          "query_total" : 0,
          "query_time_in_millis" : 0,
          "query_current" : 0,
          "fetch_total" : 0,
          "fetch_time_in_millis" : 0,
          "fetch_current" : 0,
          "scroll_total" : 0,
          "scroll_time_in_millis" : 0,
          "scroll_current" : 0,
          "point_in_time_total" : 0,
          "point_in_time_time_in_millis" : 0,
          "point_in_time_current" : 0,
          "suggest_total" : 0,
          "suggest_time_in_millis" : 0,
          "suggest_current" : 0
        }

      }

    }

  }

}

Reference PRs for PIT changes:

@reta
Copy link
Collaborator

reta commented Aug 29, 2023

Thanks @sohami , two more to suggest (the naming could be better expressed):

Metric Description
search.concurrent_pool_queue_size The queue size of the index searcher pool
search.concurrent_pool_wait_time The amount of time the index searcher tasks spend in queue vs being scheduled right away

These metrics should help with proper index searcher thread pool sizing I think.

@jed326
Copy link
Collaborator

jed326 commented Aug 29, 2023

@reta thanks for the suggestion! It seems like these metrics should go under thread_pool metrics instead of under search metrics. I do agree that they would both be useful though, what do you think?

@sohami
Copy link
Collaborator Author

sohami commented Aug 29, 2023

It seems like these metrics should go under thread_pool metrics instead of under search metrics

Threadpool queue size stats is available for all threadpool via _cat/thread_pool api. For pool_wait_time, I like the idea to add it in thread_pool metrics so it will be available for all the pools and not specifically for search.

@jed326
Copy link
Collaborator

jed326 commented Aug 30, 2023

search.query_avg_concurrency and thread_pool.pool_wait_time metrics are both not that straightforward to implement.

search.query_avg_concurrency

The MeanMetric class gives us an easy way to compute the mean within a shard, but the stats get summed up across shards in the total result:

public CommonStats getTotal() {
if (total != null) {
return total;
}
CommonStats stats = new CommonStats();
for (ShardStats shard : shards) {
stats.add(shard.getStats());
}
total = stats;
return stats;
}

This makes it difficult to compute the average concurrency across all of the shards in 2 ways. First, we only want to consider shards that have a value > 0 for average concurrency because average concurrency only considers requests that use concurrent search. Second, we need some way to track the number of shards with value > 0 that are in the overall response and take that into consideration.

thread_pool.pool_wait_time

The existing thread_pool metrics come from the ThreadPoolExecutor class in java.util.concurrent. See https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/ThreadPoolExecutor.html for more details.

if (holder.executor() instanceof ThreadPoolExecutor) {
ThreadPoolExecutor threadPoolExecutor = (ThreadPoolExecutor) holder.executor();
threads = threadPoolExecutor.getPoolSize();
queue = threadPoolExecutor.getQueue().size();
active = threadPoolExecutor.getActiveCount();
largest = threadPoolExecutor.getLargestPoolSize();
completed = threadPoolExecutor.getCompletedTaskCount();
RejectedExecutionHandler rejectedExecutionHandler = threadPoolExecutor.getRejectedExecutionHandler();
if (rejectedExecutionHandler instanceof XRejectedExecutionHandler) {
rejected = ((XRejectedExecutionHandler) rejectedExecutionHandler).rejected();
}
}
stats.add(new ThreadPoolStats.Stats(name, threads, queue, active, rejected, largest, completed));

Since wait time is not provided by the executor class, we would need to provide our own wait time calculation. Since this is a pretty involved change and affects all threadpools I will create a separate issue to track this since I do believe wait time is a valuable metric to have.

@jed326
Copy link
Collaborator

jed326 commented Aug 30, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants