Description
The current implementation of cross-cluster search treats all the remote shards as if they were local, meaning that the coordinating node first figures out where all the shards are (remote plus local ones), then fans out to all of them. From then on the search operation works the same as if all shards were local (aka query then fetch). When a significant number of shards are involved in the request, this can cause problems especially if the connection between the coordinating node and the remote clusters is not fast enough. In fact, the coordinating node will send as many shard level requests as the number of shards that are involved in the search request, which causes a significant number of round-trips between the coordinating node and the remote clusters.
In this meta issue we will outline the steps needed to implement a cross-cluster search alternate execution mode that reduces the round-trips required, which should improve its performance and scalability in certain scenarios. This task is currently at POC stage, hence we don't know if/when such changes are going to be released. The idea is to allow for multiple coordination steps in each remote cluster, so that the cross-cluster coordinating node sends out only one request per remote cluster, each of which will have its own coordinating node returning enough information for the last coordination step to be performed, which consists of reducing results coming from different clusters in one result set to be returned.
Trade-off: as we reduce the number of round-trips, the single response that is returned from each remote cluster will need to contain more info than a normal search response. Pagination will require from+size results to be requested to each cluster, terms aggregation will need a mechanism similar to shard_size
, per cluster, in order not to lose accuracy, and hits will be fetched (at least initially) in each remote cluster, meaning that most of the documents fetched will not be returned to the end user.
The first phase of the POC focuses on splitting the CCS request into one search request per cluster, and being able to merge the resulting search responses in one to be returned to the end user. The alternate execution mode will use the transport layer like CCS already does. This allows us to optionally serialize additional information as part of a search response without affecting the REST response of the search API. The following are the identified steps needed so far:
-
Benchmark the preliminary alternate execution mode and identify what scenarios are positively impacted and which scenarios are better off using the ordinary execution mode
-
Add missing info to
SearchResponse
returned by each cluster: no change is needed for sort by score, as the response already holds all the needed info. When sorting by field, we need to reconstruct theSortField[]
(to rebuild theTopFieldDocs
), which depends on mappings in the remote clusters. We also need the collapse field and collapse values in case field collapsing is used, which could be deducted from iterating all the hits and their fields but it's better to have them summarized in theSearcHits
likeCollapseFieldTopDocs
does. Also, for each hit we need the raw sorting values, from before they went throughDocValueFormat#format
. These bits of information need to be added to theSearchResponse
. One option could be to optionally serializeTopDocs
entirely from each cluster. Given though that most of the needed information is already available in other places, the size ofTopDocs
and the fact that it most likely cannot be used as-is in the coordinating node but it needs some modifications, we could instead serialize only the missing info as part of theSearchResponse
(Add sort and collapse info to SearchHits transport serialization #36555 & Add raw sort values to SearchSortValues transport serialization #36617) -
Make sure that all clusters involved use the same current time value when executing time-based queries. This should be defined in the coordinating node and serialized to the remote clusters (Add support for providing absolute start time to SearchRequest #37142)
-
Make it possible to disable the single shard optimization when determining shard_size for certain types of aggregations (see https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/bucket/BucketUtils.java#L42). Although we may be executing against a single shard per remote cluster, we will need to merge results coming from multiple clusters hence we need to make sure that we always collect enough info from each shard and never optimize for the single shard scenario (Remove single shard optimization when suggesting shard_size #37041)
-
Add a way to provide an index prefix to each search requests, so that the
_index
field will be valued with the local index name prefixed with the cluster alias. Note that this info is not known locally, hence needs to be provided by the CCS coordinating node. This is needed for scenarios where users may be aggregating on the_index
field, or when errors from each clusters are returned and the index metadata in the exception must contain the cluster alias to know where the error comes from (Add support for local cluster alias to SearchRequest #36997) -
Add support for disabling the final reduction phase for aggregations. Terms aggregations should not be truncated, and pipeline aggs should not be executed locally in each cluster. The final reduction phase needs to be executed later by the CCS coordinating node once it has received all the results coming from the remote clusters involved. (Skip final reduction if SearchRequest holds a cluster alias #37000)
-
Allow to skip max number of buckets check in non final reduction phase, as the limit is expected to be reached much more easily when aggregations are reduced but not pruned (Enforce max_buckets limit only in the final reduction phase #36152)
-
Add support for merging multiple
SearchResponse
objects which contain all the needed info (added in the steps above) into one (Add support for merging multiple search responses into one #37566) -
Add support for alternate execution mode to
TransportSearchAction
, with one request per cluster, in which each cluster performs remote reduction. The coordinating node makes one search call per cluster and merges the search responses obtained into one, then returns it (Streamline skip_unavailable handling #37672 & Introduce ability to minimize round-trips in CCS #37828) -
enable remote reduction automatically, and add a an option to opt-out in the search request (Introduce ability to minimize round-trips in CCS #37828)
-
Introduce tests that compare results obtained from the two CCS execution modes and make sure that the output is exactly the same ( Add integration tests to verify CCS output #40038)
Limitations (in all of these cases we automatically fall-back to ordinary CCS):
- scroll is not supported
- field collapsing is supported, but inner_hits cannot be properly retrieved. Inner hits are currently fetched from each cluster's coordinating node as part of the fetch phase, meaning that each remote cluster will only see its own hits. We may want to disable fetching in the remote clusters and postpone it to the last coordination step but that's a bigger change which will be evaluated later. Note that if fetching is done by the CCS node, it is no longer true that a single request is sent to each cluster, as fetching documents would require additional requests, which may end up defeating the purpose of this improvement which is all about sending one single request per cluster
- dfs_query_then_fetch outputs different scores: the dfs phase is run and its results are used to make scoring in the query phase more accurate. If each cluster runs its own dfs based on its local data and its own query phase based on such results, scoring differs.
- more_like_this query is not supported when like/unlike items need to be fetched. More like this query needs to be refactored so that it gets rewritten properly on the coordinating node, which would mean that like/unlike items would then be fetched from the coordinating node rather than on the shards. The ordinary execution mode is subject to the exact same limitation though.