Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE][RFC] Querqy Refactor with Querqy Unplugged & Search Pipelines #184

Open
macohen opened this issue Jul 26, 2023 · 5 comments
Open
Labels
enhancement change or upgrade that increases software capabilities beyond original client specifications Search:Relevance

Comments

@macohen
Copy link
Collaborator

macohen commented Jul 26, 2023

When a user enters a search, they have an intent in mind about what they want to find. This intent is typed in their own words and may not match the text in the search index. An area of search meant to assist with interpretation is query understanding. A technique in query understanding is called query rewriting. Before the index is searched, the query is examined to provide the search with more context and then the query is rewritten with this new context. This RFC suggests ways to integrate a specific library used for query rewriting and also attempts to define proposals for more generic interfaces for query rewriting in search pipelines so that builders can bring their own rewriting logic while still taking advantage of the benefits of search pipelines - logical separation,

Creating rules to refine queries in search applications is a standard practice. Users enter free text search queries with the intent to find something specific. For example, a search query on a site selling home goods could be “gas grill weber.” Through query rewriting, the engine could interpret “weber” as the brand Weber and rewrite the query to boost “gas grill” matches where the Brand field in the index is “Weber.” My assumption and my experience tells me that many search application builders do this with work with custom code or don’t know that they could do this type of rewrites at all. Querqy was developed as a plugin for ElasticSearch and Solr to help centralize and reduce complexity of rewriting. Later it was ported to OpenSearch. The plugin currently lives in the querqy Github repo and does not get upgraded with each release because this is difficult to do unless the plugin is in the opensearch-project org and has access to all of the CI infrastructure as other plugins.

Querqy comes with these rewriters that may be usable implemented as a SearchRequestProcessor:

(copied & pasted from https://docs.querqy.org/querqy/rewriters/common-rules.html)
Common Rules Rewriter
Query-dependent rules for synonyms, result boosting (up/down), filters; ‘decorate’ result with additional information.
Replace Rewriter
Replace query terms. Used as a query normalisation step, usually applied before the query is processed further, for example, before the Common Rules Rewriter is applied
Word Break Rewriter
(De)compounds query tokens. Splits compound words or creates compounds from separate tokens.
Number-Unit Rewriter
Recognises numerical values and units of measurement in the query and matches them with indexed fields. Allows for range matches and boosting of the exactly matching value.
Shingle Rewriter
Creates shingles (compounds) from adjacent query tokens and adds them as synonyms.

I propose that OpenSearch's Search Pipelines feature (https://opensearch.org/docs/latest/search-plugins/search-pipelines/index/) in combination with Querqy's library based implementation, Querqy Unplugged: https://github.com/querqy/querqy-unplugged be used to integrate multiple query rewriting components as processors. So, this could also reveal a clearer way to bring backend functionality into OpenSearch without having to move repositories into the project itself:

  1. refactor an existing plugin so we can separate plugin (OpenSearch hooks) concerns from the functionality of the plugin (Querqy itself). The plugin can also be a SearchProcessor and depend on several libraries.
  2. leave the functionality in the originating repo to be maintained there. this means we can
  3. build a plugin/search processor in the opensearch-project org that uses the functionality as a library. In the spirit of Search Pipelines, we could build a single processor for each type of rewriting operation for use.

Benefits

  • Querqy Unplugged is a library. We can create searchrequestprocessors for one or more of the Querqy rewriters on the OpenSearch side to whatever unplugged version we like; we can pin to a specific version of Querqy or upgrade as we see fit for the OpenSearch project.
  • Keep it Open: We can begin to incorporate abstracted SearchRequestProcessors that will allow non-Querqy rewriters to be incorporated; these can be good candidates long term for inclusion in Querqy or stand-alone rewriters.

Drawbacks

  • Managing the same dependency over different components could be a challenge.

Other possibilities

  • Move the plugin as is and do not take a dependency on querqy unplugged.
  • Move the plugin as is and do not integrate with search

Questions:

  • Are there other libraries besides Querqy with similar functionality that could be open-sourced or adapted to Search Pipelines? This could strengthen the use case for using Search Pipelines and could also create a path for builders on OpenSearch with custom rewriters that don’t want to maintain this functionality. They could then just move to Querqy processors.
  • What options for query rewriter integration do we have?
    • Plugin
    • SearchRequestProcessor (note that Plugins and SearchRequestProcessors are not mutually exclusive)
    • Other?
  • Would you be more inclined to use rewriters as part of a search pipeline or as a standalone plugin?
@macohen macohen added enhancement change or upgrade that increases software capabilities beyond original client specifications untriaged labels Jul 26, 2023
@macohen
Copy link
Collaborator Author

macohen commented Jul 26, 2023

Notes from our 2023-07-26 Public Meeting

  • Can’t you accomplish the same using the Painless script request processor?
    • some, but rule management can be unwieldy at some point
  • dropping a rules file could work in open source or a rest api or some way to store files.

@macohen macohen removed the untriaged label Jul 26, 2023
@JohannesDaniel
Copy link

The core benefit that Querqy gives its users is to maintain rules on a query level and to build a query tree that leads to clean scores. Parts of this are described here: https://opensourceconnections.com/blog/2021/10/19/fundamentals-of-query-rewriting-part-1-introduction-to-query-expansion/

For the retail area, it is very important that scores are clean. For instance, it should not make a difference whether a multiword-synonym has more or less terms (e.g. apple smartphone vs. iphone).

Retailers normally have thousands of business rules for various different reasons, which cannot be implemented in a generic manner, such as

  • boost products that are not that relevant for a certain query (e.g. they might be part of a discount campaign for a certain time)
  • apply a filter on a certain (very frequent / short-head) query to improve its precision
  • ensure that products are included into a response of a query, irrespective of whether they fully match or not (e.g. for adds)

@JohannesDaniel
Copy link

Retail search is quite specific regarding two aspects:

  1. For many retailers, the short-head (very frequent & repetitive queries) is already about 50% of the total search traffic.
  2. Retailers frequently have very specific business logic to implement, e.g. they might have a larger offering of products, but make their main business only with a small subset of products or only a specific category.

You have a lot of quick wins in this area if you are flexible to specifically deal with short-head queries.

@macohen
Copy link
Collaborator Author

macohen commented Aug 8, 2023

Thanks @JohannesDaniel. I think I get what you're saying. I was thinking about this in terms of how we implement a search processor and less about the specifics of retail query rewriting. For example, when reranking search results, there are many ways to do this: multiple services, scripting inside OpenSearch, Learning-to-Rank. But, the core pattern is the same: take search results, manipulate them (if needed) to be sent to the reranker if the OpenSearch hits need to be transformed, for example, rerank, return the results, transform again if needed, and then profit. ;) Of course, something like Learning-to-Rank also requires judgements, feature generation, and feature logging, but those can be handled elsewhere and integrated with clean APIs.

So, when we think about how anything should be integrated into a Search Pipeline, I want to understand if there is a layer of abstraction we can introduce to make it possible to have different types of replace rewriters so OpenSearch is able to accommodate other types of rewriters for other use cases that may not require specific rulesets. This is something we can discover in the design/prototyping of the processors themselves.

@JohannesDaniel
Copy link

Notes from our 2023-07-26 Public Meeting

  • Can’t you accomplish the same using the Painless script request processor?

    • some, but rule management can be unwieldy at some point
  • dropping a rules file could work in open source or a rest api or some way to store files.

You cannot maintain tons of business rules with painless script. Furthermore, Querqy checks whether queries meet certain attributes against hundreds or thousands of rules (usually, retail companies maintain such an amount of business rules). This requires specific optimizations and proper rewriting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement change or upgrade that increases software capabilities beyond original client specifications Search:Relevance
Projects
Status: 👀 In review
Development

No branches or pull requests

2 participants