[RFC] Low Level Design for Normalization and Score Combination Query #174

martin-gaievski · 2023-05-19T20:14:59Z

Introduction

This issue describes details of Low Level Query clause Design in scope of Score Normalization and Combination Feature. This is one of multiple LLDs in scope the Score Normalization and Combination Feature. Pre-read of high-level design [RFC] High Level Approach and Design For Normalization and Score Combination is highly recommended. We expect antecedent design for API to be published soon, for now will be referencing to its Draft version.

Background

As per HLD and API LLD in scope of this feature we need a new Query clause in OpenSearch. This new Query will fetch results at a shard level for different sub-queries during the Query phase of request handling. Query results will be processed in a later phase by a new processor on coordinator node. Proposed name for new Query is "Hybrid".

At each shard execution will be independent from other shards. Focus of the change is to get all tops scores for each sub-query, all packing and reduce will happen in later stages.

New Query will be added as part of the Neural Search plugin and most of the code changes will be done in the plugin repo.

Requirements

Different sub-queries should be abstracted and not be limited by particular query types like k-NN or text match based on term or keywords.
New query should keep added latency (for functions like query parsing etc.) to minimum and not degrade performance in both latency and resource utilization comparing to a similar query that does combination at shard level.

Scope

In this document we propose a solution for the questions below:

How do we handle sub queries of the main query

Out of Document Scope

Following items will be covered in other design documents (etc. API Design Design):

Query clause name.
Structure of the search query, including Normalization and Score Combination techniques and parameters.
How sub-query results are collected and transferred to coordinator node for normalization and combination.

Solution Overview

New query Hybrid will be registered in a Neural Search plugin using new QueryBuilder class. Builder class will create a new Query class implementation that will have logic to execute each sub-query and combine weights per sub-query at a shard level. New query needs a doc collector that will process results of each sub-query and get top x results (top docs) at shard level. This information should clearly identify to which sub-query each result belongs. Metrics like max and min query can be added if needed.
These results will be used by Query Phase Searcher to pack and send shard results to coordinator node for normalization and score combination.
Overall class structure is very similar to a DisjunctionMax query that is part of Lucene Defining _search API Input(#1): Score Combination and Normalization for Semantics Search[HLD]

Feature will be available to users of a Neural Search plugin, that is experimental. Once user enables the plugin Normalization and Score Combination feature became available automatically.

Risks / Known limitations

In this design we will not create new DTO object to store scores of individual sub-queries. This is required for doc collector and will be added later together with custom QueryPhaseSearcher implementation. For this implementation we will use existing core DTOs, this allows to collect and return scores from only first sub-query.

In initial implementation phase we are skipping pagination for query results. Mainly this is based on complexity of implementation, and foreseen performance overhead for some query type. For instance, k-NN query (which is a base query for semantic search in neural-search plugin) must collect not only last “page” in results, but also all previous pages (e.g. to get results 60-80 k-NN will select first 80 results and ignore 0-60 results). Such approach is very inefficient and is breaking functional requirement for minimal added latency.

In initial implementation phase we are skipping query explain. Feature will be released in experimental mode and we want to make it stable before providing details on how query results are formed.

In initial implementation phase we’ll be using default sequential sequence of execution for sub-queries.

Future extensions

pagination for query results
execute sub-queries in parallel
filters at the HybridQuery level (single filter for every sub-query result)‘
explain for query execution

Solution Details

We’re going to use existing plugin class NeuralSearch as an entry point to register new HybridQueryBuilder. Builder class creates instance of HybridQuery that will encapsulate logic for getting results for each of sub-queries. Part of the Query responsibility is creation of Weight and Scorer that both provide scores for results of each sub-query at the shard level.

Figure 1: Class diagram for Hybrid Query implementation

All the classes and logic above are agnostic to a type of sub-query, this allows to target one of the functional requirements about being query type agnostic

Below is the general data flow for getting query results for a new Hybrid Query. This represents a single shard, same execution happens for every shards in index.

Figure 2: General sequence diagram for getting query results

API Interface

For multiple sub-queries query supports json array

PUT <index-name>/_search
{
    "query": {
        "hybrid": {
            "queries": [
                { /* neural query */ }, // this is added for an example
                { /* standard text search */ } // If a user want to boost some scores or update 
               // the scores we need to go ahead and do it in this query clause
            ]
        }
    }
}

Single sub-query can be passed as a json object

PUT <index-name>/_search
{
    "query": {
        "hybrid": {
            "queries": 
                { /* neural query */ }, // this is added for an example
                // the scores we need to go ahead and do it in this query clause
        }
    }
}

Query Builder

Parses input and produces instance of Query class. For parsing of each sub-query we using using existing core logic from AbstractQueryBuilder.
Class has collection of query builders for each sub-query

Query

Rewrites each of sub-queries.
Class has collection of Query for each sub-query

Weight

Constructs weight object for each sub-query. Return HybridScorer object that has scorers for each sub-query.
Throws exception if get Explain is called.

Scorer

Responsible for iterating over results of each sub-query in desc order. Keeps the priority queue of doc id. For each next doc id get score from each sub-query. Sub-query scores are stored in array in a same order they were in the input query, it allows to map sub-query to its score.

Plugin

Registers Hybrid query and returns collection of QuerySpes for the NeuralSearch plugin.

Main implementation details related to potential security threats:

limit the number of sub-queries in a single hybrid query, 5 sub-queries max in first release
sub-queries will be parsed by existing logic in core
error messages will contain minimal information, mirroring user input should be avoided. Detailed information will be logged instead
integration test(s) for this feature will run as part of periodical CI task with security plugin enabled

Testability

Query is testable via existing /search REST API and lower level direct API calls. Main testing will be done via unit and integration tests. We don’t need backward compatibility tests as Neural-search is in experimental mode and there is no commitment for support of previous versions.

Tests will be focused on overall query stability. Below are main test cases that will be covered:

build Hybrid Query from the user input if such input has no sub-queries, one sub-query and multiple sub-queries
fail with expected error message if user tries to build Hybrid Query from the incorrect or invalid input
fail with expected error message if number of sub-queries is more than designed max number (5)
low level query rewrite calls return expected results
serialization and deserialization of query objects work (this addresses cluster with multiple nodes)
base query and query builder functions - hashcode, equals, etc.

Mentioned tests are part of the plugin repo CI and also can be executed on demand from development environment.

Tests for metrics like score correctness, performance etc. will be added in later implementations when end-to-end solution will be available.

Reference Links

Meta Issue for Feature: [META] Score Combination and Normalization for Semantics Search. Score Normalization for k-NN and BM25 #123
[RFC] High Level Approach and Design For Normalization and Score Combination: [RFC] High Level Approach and Design For Normalization and Score Combination #126
Dis_max Query: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-dis-max-query.html

The text was updated successfully, but these errors were encountered:

navneet1v · 2023-09-22T21:09:06Z

Resolving this github issue as the changes for RC of 2.10 is finalized and merged. Please create a github issue if there are any further questions.

martin-gaievski added Enhancements Increases software capabilities beyond original client specifications neural-search Features Introduces a new unit of functionality that satisfies a requirement RFC v2.9.0 labels May 19, 2023

martin-gaievski self-assigned this May 19, 2023

github-actions bot added the untriaged label May 19, 2023

martin-gaievski removed the untriaged label May 19, 2023

This was referenced May 19, 2023

[FEATURE] New Query for Normalization and Score Combination Query #175

Closed

[META] Score Combination and Normalization for Semantics Search. Score Normalization for k-NN and BM25 #123

Closed

This was referenced Jun 3, 2023

[RFC] Low Level Design for Normalization and Score Combination Query Phase Searcher #193

Closed

[FEATURE] New Doc Collector for Normalization and Score Combination Query #194

Closed

navneet1v added v2.10.0 Issues targeting release v2.10.0 and removed v2.9.0 labels Jul 15, 2023

martin-gaievski mentioned this issue Jul 19, 2023

[FEATURE] Provide way of defining methods for score normalization and combination in scope of Hybrid search #228

Closed

2 tasks

navneet1v closed this as completed Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Low Level Design for Normalization and Score Combination Query #174

[RFC] Low Level Design for Normalization and Score Combination Query #174

martin-gaievski commented May 19, 2023

navneet1v commented Sep 22, 2023

[RFC] Low Level Design for Normalization and Score Combination Query #174

[RFC] Low Level Design for Normalization and Score Combination Query #174

Comments

martin-gaievski commented May 19, 2023

Introduction

Background

Requirements

Scope

Out of Document Scope

Solution Overview

Risks / Known limitations

Future extensions

Solution Details

Figure 1: Class diagram for Hybrid Query implementation

Figure 2: General sequence diagram for getting query results

API Interface

Query Builder

Query

Weight

Scorer

Plugin

Testability

Reference Links

navneet1v commented Sep 22, 2023