Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Low Level Design for Normalization and Score Combination Query #174

Closed
martin-gaievski opened this issue May 19, 2023 · 1 comment
Closed
Assignees
Labels
Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement neural-search RFC v2.10.0 Issues targeting release v2.10.0

Comments

@martin-gaievski
Copy link
Member

Introduction

This issue describes details of Low Level Query clause Design in scope of Score Normalization and Combination Feature. This is one of multiple LLDs in scope the Score Normalization and Combination Feature. Pre-read of high-level design [RFC] High Level Approach and Design For Normalization and Score Combination is highly recommended. We expect antecedent design for API to be published soon, for now will be referencing to its Draft version.

Background

As per HLD and API LLD in scope of this feature we need a new Query clause in OpenSearch. This new Query will fetch results at a shard level for different sub-queries during the Query phase of request handling. Query results will be processed in a later phase by a new processor on coordinator node. Proposed name for new Query is "Hybrid".

At each shard execution will be independent from other shards. Focus of the change is to get all tops scores for each sub-query, all packing and reduce will happen in later stages.

New Query will be added as part of the Neural Search plugin and most of the code changes will be done in the plugin repo.

Requirements

  1. Different sub-queries should be abstracted and not be limited by particular query types like k-NN or text match based on term or keywords.

  2. New query should keep added latency (for functions like query parsing etc.) to minimum and not degrade performance in both latency and resource utilization comparing to a similar query that does combination at shard level.

Scope

In this document we propose a solution for the questions below:

  1. How do we handle sub queries of the main query
Out of Document Scope

Following items will be covered in other design documents (etc. API Design Design):

  1. Query clause name.
  2. Structure of the search query, including Normalization and Score Combination techniques and parameters.
  3. How sub-query results are collected and transferred to coordinator node for normalization and combination.

Solution Overview

New query Hybrid will be registered in a Neural Search plugin using new QueryBuilder class. Builder class will create a new Query class implementation that will have logic to execute each sub-query and combine weights per sub-query at a shard level. New query needs a doc collector that will process results of each sub-query and get top x results (top docs) at shard level. This information should clearly identify to which sub-query each result belongs. Metrics like max and min query can be added if needed.
These results will be used by Query Phase Searcher to pack and send shard results to coordinator node for normalization and score combination.
Overall class structure is very similar to a DisjunctionMax query that is part of Lucene Defining _search API Input(#1): Score Combination and Normalization for Semantics Search[HLD]

Feature will be available to users of a Neural Search plugin, that is experimental. Once user enables the plugin Normalization and Score Combination feature became available automatically.

Risks / Known limitations

In this design we will not create new DTO object to store scores of individual sub-queries. This is required for doc collector and will be added later together with custom QueryPhaseSearcher implementation. For this implementation we will use existing core DTOs, this allows to collect and return scores from only first sub-query.

In initial implementation phase we are skipping pagination for query results. Mainly this is based on complexity of implementation, and foreseen performance overhead for some query type. For instance, k-NN query (which is a base query for semantic search in neural-search plugin) must collect not only last “page” in results, but also all previous pages (e.g. to get results 60-80 k-NN will select first 80 results and ignore 0-60 results). Such approach is very inefficient and is breaking functional requirement for minimal added latency.

In initial implementation phase we are skipping query explain. Feature will be released in experimental mode and we want to make it stable before providing details on how query results are formed.

In initial implementation phase we’ll be using default sequential sequence of execution for sub-queries.

Future extensions

  • pagination for query results
  • execute sub-queries in parallel
  • filters at the HybridQuery level (single filter for every sub-query result)‘
  • explain for query execution

Solution Details

We’re going to use existing plugin class NeuralSearch as an entry point to register new HybridQueryBuilder. Builder class creates instance of HybridQuery that will encapsulate logic for getting results for each of sub-queries. Part of the Query responsibility is creation of Weight and Scorer that both provide scores for results of each sub-query at the shard level.

HybridQueryClause-ClassDiagram drawio

Figure 1: Class diagram for Hybrid Query implementation

All the classes and logic above are agnostic to a type of sub-query, this allows to target one of the functional requirements about being query type agnostic

Below is the general data flow for getting query results for a new Hybrid Query. This represents a single shard, same execution happens for every shards in index.

Normalization and Score combination-Sequence diagram updated drawio

Figure 2: General sequence diagram for getting query results

API Interface

For multiple sub-queries query supports json array

PUT <index-name>/_search
{
    "query": {
        "hybrid": {
            "queries": [
                { /* neural query */ }, // this is added for an example
                { /* standard text search */ } // If a user want to boost some scores or update 
               // the scores we need to go ahead and do it in this query clause
            ]
        }
    }
}

Single sub-query can be passed as a json object

PUT <index-name>/_search
{
    "query": {
        "hybrid": {
            "queries": 
                { /* neural query */ }, // this is added for an example
                // the scores we need to go ahead and do it in this query clause
        }
    }
}

Query Builder

Parses input and produces instance of Query class. For parsing of each sub-query we using using existing core logic from AbstractQueryBuilder.
Class has collection of query builders for each sub-query

Query

Rewrites each of sub-queries.
Class has collection of Query for each sub-query

Weight

Constructs weight object for each sub-query. Return HybridScorer object that has scorers for each sub-query.
Throws exception if get Explain is called.

Scorer

Responsible for iterating over results of each sub-query in desc order. Keeps the priority queue of doc id. For each next doc id get score from each sub-query. Sub-query scores are stored in array in a same order they were in the input query, it allows to map sub-query to its score.

Plugin

Registers Hybrid query and returns collection of QuerySpes for the NeuralSearch plugin.

Main implementation details related to potential security threats:

  • limit the number of sub-queries in a single hybrid query, 5 sub-queries max in first release
  • sub-queries will be parsed by existing logic in core
  • error messages will contain minimal information, mirroring user input should be avoided. Detailed information will be logged instead
  • integration test(s) for this feature will run as part of periodical CI task with security plugin enabled

Testability

Query is testable via existing /search REST API and lower level direct API calls. Main testing will be done via unit and integration tests. We don’t need backward compatibility tests as Neural-search is in experimental mode and there is no commitment for support of previous versions.

Tests will be focused on overall query stability. Below are main test cases that will be covered:

  • build Hybrid Query from the user input if such input has no sub-queries, one sub-query and multiple sub-queries
  • fail with expected error message if user tries to build Hybrid Query from the incorrect or invalid input
  • fail with expected error message if number of sub-queries is more than designed max number (5)
  • low level query rewrite calls return expected results
  • serialization and deserialization of query objects work (this addresses cluster with multiple nodes)
  • base query and query builder functions - hashcode, equals, etc.

Mentioned tests are part of the plugin repo CI and also can be executed on demand from development environment.

Tests for metrics like score correctness, performance etc. will be added in later implementations when end-to-end solution will be available.

Reference Links

  1. Meta Issue for Feature: [META] Score Combination and Normalization for Semantics Search. Score Normalization for k-NN and BM25 #123
  2. [RFC] High Level Approach and Design For Normalization and Score Combination: [RFC] High Level Approach and Design For Normalization and Score Combination #126
  3. Dis_max Query: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-dis-max-query.html
@navneet1v
Copy link
Collaborator

Resolving this github issue as the changes for RC of 2.10 is finalized and merged. Please create a github issue if there are any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement neural-search RFC v2.10.0 Issues targeting release v2.10.0
Projects
None yet
Development

No branches or pull requests

2 participants