Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Provide way of defining methods for score normalization and combination in scope of Hybrid search #228

Closed
1 of 2 tasks
martin-gaievski opened this issue Jul 19, 2023 · 4 comments
Labels
Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement

Comments

@martin-gaievski
Copy link
Member

martin-gaievski commented Jul 19, 2023

Description

For Normalization and Score Combination feature, we need actual processing unit that will process scores collected on Query phase of Hybrid search. We need approach to define different techniques for score normalization and combination.

Solution

Solution we are proposing is to create new implementation of a Search phase result processor. This Processor will be setup as part of search pipeline to be called between Query and Fetch phases. More details on such processors can be found in corresponding core PR

Processor will support predefined set of techniques for normalization and combination. Exact techniques are defined using search pipeline API and then it must be referenced from _search call. We start from min-max for normalization and arithmetic mean for combination.

Processor definition may look something like this:

{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "normalization-processor": {
                "normalization": {
                    "technique": "MIN_MAX"
                },
                "combination": {
                    "technique": "ARITHMETIC_MEAN",
                    "parameters": {
                        "weights": [
                            0.4, 0.7
                        ]
                    }
                }
            }
        }
    ]
}

Tasks

  • Implementation of a Search phase result processor
  • Testing

Reference Links

  1. [RFC] High Level Approach and Design For Normalization and Score Combination #126
  2. [RFC] Low Level Design for Normalization and Score Combination Query #174
  3. [RFC] Low Level Design for Normalization and Score Combination Query Phase Searcher #193
  4. Adding the SearchPhaseResultsProcessor interface in Search Pipeline OpenSearch#7283
@austintlee
Copy link

For weights, have you considered this format:

"weights": {
    "knn": 0.4,
    "bm25": 0.6
}

@martin-gaievski
Copy link
Member Author

For weights, have you considered this format:

"weights": {
    "knn": 0.4,
    "bm25": 0.6
}

@austintlee I think with such format you need a way to map between exact sub-query and key name. For example, my query may look something like this:

    "query": {
        "hybrid": {
            "queries": [
                {
                    "neural": {}
                },
                {
                    "match": {}
                },
                {
                    "match": {}
                }
                {
                    "bool": {
                        "should": [
                            {
                                "nested": {
                                    "path": "quest",
                                    "query": {
                                        "knn": {}
                                    }
                                }
                            }
                        ]
                    }
                }
            ] }}

we need to map each of 4 sub-queries to its weight. For instance it can be a query type, but I see few problems with such approach: which key to take for nested queries like bool [match], what if we need different weights for different sub-queries of same type.
Do you have something in mind for the mapping?

@austintlee
Copy link

I didn't realize this feature aspires to implement a generic hybrid search. I was under the impression that it simply combines a BM25 search and a KNN search which is why I thought you'd always have two weights that add up to 1.0.

Don't the weights need to sum to 1? It looks like in the current implementation, you assign a weight of 1.0 to sub-queries that are not matched to the weights specified in the query. In other words, if you have 2 weights in the input and 4 sub-queries, the 3rd and 4th sub-queries seem to get a weight of 1.0?

@navneet1v
Copy link
Collaborator

Don't the weights need to sum to 1?

Yes the weights need to sum up to 1. We didn't add this check at start. This needs to be added.

@austintlee This query clause that we are building is not specific to k-NN or bm-25. The new query clause is intended to be used for any n number of queries(where n <= 5) which are providing scores at different scale.

Also, if you look closely you will see that k-NN query can be created from different query clauses like neural or any other clause in future. So, atleast code doesn't have a way to understand what is k-NN and what is BM-25. So this helps solve that problem also. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement
Projects
None yet
Development

No branches or pull requests

3 participants