[RFC] High Level Vision for Core Search in OpenSearch

### Introduction

This issue proposes high level vision for core search within Opensearch. While the initial common themes are identified from customer feedback for Amazon Opensearch Service, any comments about the overall direction are welcome for making it applicable to more diverse use cases


### Background

Search component forms the core of applications based on Opensearch. The ability to search data efficiently is critical enough to influence the data storage format during indexing keeping search first. Compared to traditional search engines, Opensearch supports insightful aggregations in addition to filtering allowing customers to do much more and visualize their data better

### Common Themes
- **Resiliency** - The search resiliency is one of the key pain points for customers and can result in low availability due to single/few search queries
- **Performance** - The performance aspect can be categorized into single query execution latency and overall throughput of cluster (ability to handle more load with less resources)
- **Visibility** - Customers have limited to no visibility into query execution today. Root causing issues related to query latency are very difficult and time consuming
- **Control & Management** - It is hard to manage resources for search alone as it is strongly coupled with indexing, merge, snapshot components. These competing resources can impact each other adversely limiting our ability to make better backpressure / admission control decisions
- **New Features** - While continuing to improve the performance / resiliency of existing product, need to offer more features like Query Prioritization & Scheduling

### Proposal

I am capturing below some of the projects that can be prioritized or are being worked upon in other RFCs. Feel free to link any other ideas / projects already in flight for Core Search

**Resiliency** - The cluster can go down even with single query today, which should not be allowed. Recovery from node failures in OpenSearch is an expensive process and can lead to cascading failure. Hence, it makes sense to cancel/kill aggressively and what not to ensure the node stays up. Search back pressure is good first step in direction of making it more robust. The key component is the lightweight resource tracking that can be leveraged to do:
  * **Query Sandboxing** - We have been discussing sandboxing for restricting the resource foot print in context of single query, but sandboxing makes more sense at bucket level. We can define per bucket thresholds configurable as a cluster setting to preemptively kill a query from the bucket for which threshold is hit. Every query belongs to the default bucket and can belong to one of the customer defined custom bucket. This will be particularly helpful for the multi-tenant/multi-user workloads where rogue queries from one customer can impact the other customer. The work is being tracked as part of RFC - https://github.com/opensearch-project/OpenSearch/issues/11061
  * **Hard query cancellation** - The current implementation of search backpressure relies on the acknowledgement of cancellation signal by problematic query. However, in many cases that does not happen. Hence, we need better query cancellation mechanism. The plan is to first track how many such cases are there using search backpressure as reference and track the common code paths where query cancellation gets stuck. The work is being tracked as part of RFC - https://github.com/opensearch-project/OpenSearch/issues/6953
  * **Query Admission control** - We have basic admission control in terms of search queue on every node which is at shard level and fixed size or at cluster level request when node is under duress. We need to improve the mechanism as single query can hit 10 / 100 / 1000 shards. The admission control mechanism needs to be adaptive accounting for query cost

**Performance** - We need to look at ways of inherently improving the query performance. This becomes even more critical with the decoupling of storage and compute, remote storage performance is nowhere close to the hot counterpart
  * **Disk based caching** - Opensearch has multiple caches for query, fielddata readily available for quick access. But, these caches reside in JVM heap due to which the space is limited and starts eviction process. Instead, we can overflow the cache to disk which is really useful for timeseries workload as old data seldom changes. Work is being tracked as part of https://github.com/opensearch-project/OpenSearch/issues/9001
  * **Query rewriting** - The query performance can be significantly improved by providing query execution hints to prefer some data structures over the other. For example - fetching few fields from doc values instead of stored fields or _source is much more efficient by reading lesser data from disk and avoiding expensive data decompression. Or using map execution_mode instead of global ordinals. RFC - https://github.com/opensearch-project/OpenSearch/issues/12390
  * **Concurrent segment search** - Every shard can have multiple segments and they are searched linearly by default. Lucene recently added support for searching those segments concurrently. That should give significant boost for individual queries and work is being tracked as part of https://github.com/opensearch-project/OpenSearch/issues/6798

**Visibility** - Customers have very limited to no visibility into query execution today. Root causing issues related to query latencies are very difficult and time consuming
  * **Coordinator slow logs** - The slow logs is currently at the shard level which does not necessarily (mostly) translate to end customer request latency. It does not differentiate at all between single query hitting 10 shards vs 1000 shards. That makes it really hard for customers to come up with right thresholds for enabling and using the slow logs. Ideally, they should be able to directly correlate the thresholds with their SLAs and figure out the offenders. This feature along with query tracing can be very insightful for our customers. The work is being tracked as part of https://github.com/opensearch-project/OpenSearch/issues/7334
  * **Query Tracing** - One of the most common challenges for customers is to identify why only few queries ran with higher latencies. The aggregated node level metrics do not reveal much when a fraction of queries show higher latencies. Customer should be able to define varying degrees of tracing, to see if there are latency issues within specific clauses of query execution, and apply the optimizations at a query level
  * **Latency breakdown metrics** - The latency metrics exposed at node level today are just indexing and search. The increase in search latency hard gives any clue about what the culprit might be (prefilter, query, fetch and aggregation). It also fails to give any clue about the kind of workload it is. Hence, proposal it is to break down single monolithic took_time into individual time taken across multiple phases. The work is being tracked as part of https://github.com/opensearch-project/OpenSearch/issues/7334
  * **Query profiling** - The query profiling is powerful tool giving useful insights into execution plan of query, breakdown of took time across different clauses of the query. However, it only works well for query phase of execution and provides very limited insights into fetch and aggregation phase. Enhancing this framework allows users to get customer insights into the problematic query execution and associated costs. https://github.com/opensearch-project/OpenSearch/issues/1764
  * **Query Planning** - The planner provides peek into the query execution plan along with estimated costs. The cost can include both time and resources like CPU/Memory/IO. Hence, the preferred plan after semantic analysis of the query can be optimized for time or even CPU/Memory/IO

**New Features**
  * **Query Prioritization** - Different search queries have different priorities and implications. Search query triggered from customer facing interface is much more critical than the one triggered by an operator sitting at the backend. Currently the queries are processed FIFO, which is not ideal when the cluster is under load and queues are backing up. Opensearch has notion of priority for cluster state update tasks. We can use something similar at coordinator and node level to treat query as high/medium/low priority and process accordingly. Customer can set cluster default and override the default for specific set of queries. https://github.com/opensearch-project/OpenSearch/issues/1017

  * **Query scheduling** - Not all queries are same. Some queries are expensive but don’t need result immediately. Customers might want to schedule some of their expensive queries during low traffic time and look at the results later. We need to build some logic for scheduling the queries and probably can leverage mechanism similar to async search for saving the results

**Next Steps**
We will incorporate the feedback from this RFC into more detailed proposal / high level design for different themes. We will then create meta issues to go in more depth for the detailed design

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] High Level Vision for Core Search in OpenSearch #8879

Introduction

Background

Common Themes

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development