Skip to content

Discussion on combination of result sets from different query types #4557

Closed
@jmazanec15

Description

Recently, from k-NN, we have received interest in combining results from text matching queries with k-NN queries (ref). So, I wanted to start a discussion about result combination and the problems doing it when scores are computed at the shard level.

From my understanding, the only way to currently combine scores from different queries is through a boolean query. In a boolean query, multiple query clauses are provided in the should or must section and the scores are combined via addition. On top of this, with function scoring, it is possible to manipulate the score of the results further before the addition of scores.

That being said, I see a few limitations with this approach.

First, it is difficult to combine result sets whose scores are produced via different scoring methods. In order to effectively combine results, the different queries' scores would need to be put on the same scale. By this, I mean that a score would need to meet 2 requirements: (1) indicates its relative relevance between it and the other documents scored in the query and (2) also be comparable with the relative relevance of results from other queries. For example, for k-NN, the score range may be 0-1 while BM25 scoring would be 0-Float.MAX_INT (I think). With this, it would be difficult to figure out an effective way to weight each result appropriately. One way to do this would be to normalize the scores before combining them. Normalization might be possible through rescoring, but this would happen at the shard level, which could cause problems when results are combined. For instance, if one shard has better results than another, normalization may skew the importance so that the top results from the latter shard are better than the former shard.

Second, it is not possible to consider global hits for re-ranking. Because scores are assigned at the shard level, any rescoring has to be done at the shard level. I see a problem with this in two cases: first, for score combination, if an index has a significant number of shards, a user may want only the top results to be combined instead of combining for the results of all shards in the index and then aggregating at the coordinator; second, in the future, a user may have a model that they want to run the results through to re-rank them that is expensive and they dont want to run for each shard in the index.

That being said, my preliminary thoughts to creating a solution for this would be to create some kind of search and merge api that might look like:

GET /{index_name}/_search_and_merge
{
    "queries": [
      {
        "query": {
          "knn": {
            "my_vector2": {
              "vector": [2, 3, 5, 6],
              "k": 2
            }
          }
        }
      },
      {
        "query": {
          "match_all": {}
        }
      }
    ],
    "merge": {
      "norm": true,
      "disjunctive": false,
      "strategy": "linear_combo"
      "params": {
        "weight_1": 10.2,
        "weight_2": 0.25
      }
    },
    "size": 100
}

Here the queries would be a list of queries to be executed independently and merge would contain the logic to combine the results at the coordinator level.

All that being said, I want to see what people think of doing this? Is there a way to accomplish this with the current search interface?

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Indexing & SearchdiscussIssues intended to help drive brainstorming and decision makingenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions