Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for generic re-ranker interface and opensearch ml re-ranker for improving search relavancy. #494

Merged
merged 27 commits into from
Jan 16, 2024

Conversation

HenryL27
Copy link
Contributor

@HenryL27 HenryL27 commented Nov 18, 2023

Description

Adds a rerank processor interface and cross-encoder rerank processor implementation

PUT /_search/pipeline/rerank-pipeline
{
  "response_processors": [
    {
      "rerank": {
        "text_similarity": {
          "model_id": <model_id>
        },
        "context": {
          "rerank_context_field": [<list_of_fields_to_rerank_based_on>]
        }
      }
    }
  ]
}

Search with

POST index/_search?search_pipeline=rerank-pipeline
{
  "query": {...},
  "ext": {
    "rerank": {
      "query_context": {
        "query_text | query_text_path": <question to rerank off of | path in query to question to rerank off of>
      }
    }
  }
}

Issues Resolved

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@HenryL27
Copy link
Contributor Author

@navneet1v @vamshin Reranking

@navneet1v
Copy link
Collaborator

@HenryL27 before I can review this PR, can we make sure that GH actions are successful

@HenryL27
Copy link
Contributor Author

@HenryL27 before I can review this PR, can we make sure that GH actions are successful

Sure. It's blocked behind opensearch-project/ml-commons#1615, but once that gets merged, which should be soon (right @ylwu-amzn?) this should hopefully do better

@navneet1v
Copy link
Collaborator

@HenryL27 before I can review this PR, can we make sure that GH actions are successful

Sure. It's blocked behind opensearch-project/ml-commons#1615, but once that gets merged, which should be soon (right @ylwu-amzn?) this should hopefully do better

Please go ahead and resolve the conflict too.

@navneet1v
Copy link
Collaborator

@HenryL27 this PR is not updated with the recent comments I added on your RFC here: #485 (comment)

I don't see any response from your side on the interface changes that were recommended. Hence pasting the comment here. Please check those comments.

@HenryL27
Copy link
Contributor Author

HenryL27 commented Dec 1, 2023

@HenryL27 this PR is not updated with the recent comments I added on your RFC here: #485 (comment)

I don't see any response from your side on the interface changes that were recommended. Hence pasting the comment here. Please check those comments.

So sorry! Thank you for reminding me about this

@HenryL27
Copy link
Contributor Author

HenryL27 commented Dec 1, 2023

bug that I came across: if the reranking_context_field doesn't exist in one of the search results, this fails (with npe). I'm thinking the correct behavior in this case is to assign the lowest seen score to such docs? @martin-gaievski wdyt?

@HenryL27 HenryL27 marked this pull request as draft December 1, 2023 23:10
@martin-gaievski
Copy link
Member

  • Cross Encoder PR

bug that I came across: if the reranking_context_field doesn't exist in one of the search results, this fails (with npe). I'm thinking the correct behavior in this case is to assign the lowest seen score to such docs? @martin-gaievski wdyt?

do you know why reranking_context doesn't exist? without knowing more info it's hard to decide on what score we should assign, lowest seen score maybe not a best option in some cases, say missing context means there are no matches but lowest score mean - there is a hit with lowest score.

@HenryL27
Copy link
Contributor Author

HenryL27 commented Dec 2, 2023

context field doesn't exist because it simply wasn't present in that particular document - was doing a parent-children index, and the parent doesn't have a text_representation. But hmm, in cases where maybe the index or processor was misconfigured and there are no hits at all... I feel like that should get its own special error message.

@HenryL27 HenryL27 marked this pull request as ready for review December 5, 2023 21:43
@HenryL27
Copy link
Contributor Author

HenryL27 commented Dec 5, 2023

merged opensearch-project/ml-commons#1615 so can probly run the workflow?

Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took first pass on the PR and was able to only complete till DocumentContextSourceFetcher.java. Will do the next review once the above comments are resolved and code is updated based on the suggestions provided on the RFC.

* @param label label of a RerankType
* @return RerankType represented by the label
*/
public static RerankType from(String label) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create this function?

can we use RerankType.valueOf() function provided in Enum classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a capitalization thing... .valueOf() wants an exact match which would mean that I either lowercase my RerankTypes or uppercase the API. Would it be easier to digest this if I used a hash instead? I don't think I should require that I call .upper() on all my strings


private String contextFromSearchHit(final SearchHit hit, final String field) {
if (hit.getFields().containsKey(field)) {
return (String) hit.field(field).getValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this type casting work for the integers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope! but String.valueOf(.) does the right thing, right?

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
@vibrantvarun
Copy link
Member

@navneet1v do we need to add BWC test here?

@navneet1v
Copy link
Collaborator

@navneet1v do we need to add BWC test here?

as this is a first release of the feature we don't need it. But after the release BWC tests needs to be added.

@HenryL27
Copy link
Contributor Author

looks like knn things are causing integ tests to fail. What's going on here?

@navneet1v
Copy link
Collaborator

looks like knn things are causing integ tests to fail. What's going on here?

There is codec upgrade which is happened in Opensearch due to lucene upgrade and impacted k-NN. The PR for k-NN is already raised and will be merged soon.

PR: opensearch-project/k-NN#1383 (review)

@navneet1v navneet1v changed the title Rerank Adding generic re-ranker interface and opensearch ml re-ranker for improving search relavancy. Jan 12, 2024
@navneet1v navneet1v changed the title Adding generic re-ranker interface and opensearch ml re-ranker for improving search relavancy. Adding support for generic re-ranker interface and opensearch ml re-ranker for improving search relavancy. Jan 12, 2024
@navneet1v navneet1v added Features Introduces a new unit of functionality that satisfies a requirement v2.12.0 Issues targeting release v2.12.0 backport 2.x Label will add auto workflow to backport PR to 2.x branch and removed backport 2.x Label will add auto workflow to backport PR to 2.x branch labels Jan 12, 2024
Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code looks good to me. Approving the PR.

As a next step I will add more details on the RFC around what is the next steps after the PR is approved.

@HenryL27
Copy link
Contributor Author

thanks!

@martin-gaievski martin-gaievski changed the base branch from main to feature/reranker January 16, 2024 18:37
@HenryL27
Copy link
Contributor Author

"Only those with write access to this repository can merge pull requests."

@heemin32 heemin32 merged commit 3a7903f into opensearch-project:feature/reranker Jan 16, 2024
75 checks passed
@martin-gaievski
Copy link
Member

"Only those with write access to this repository can merge pull requests."

Will merge it now, it goes to a feature branch. We'll need to perform certain intake activities like review with security team, that's going to be based on a feature branch, only once that completed code can be merged to main

@HenryL27
Copy link
Contributor Author

ofc, thanks

ylwu-amzn pushed a commit to ylwu-amzn/neural-search that referenced this pull request Jan 26, 2024
…anker for improving search relavancy. (opensearch-project#494)

* Add rerank processor interfaces

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add cross-encoder specific logic and factory

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add unittests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add integration test

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* use string.format() instead of concatenation

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* rename generateScoringContext to generateRerankingContext

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add name change in test too. whoops

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* start refactoring with contextSaourceFetchers

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* refactor to use contextSourceFetchers to get context

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* rename CrossEncoder to TextSimilarity

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add query_context layer to search ext

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add javadocs

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* update to new asyncProcessResponse api

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* rename reranktype to ML_OPENSEARCH

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* improve error messages for bad rerank type config

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* simplify configuration/factory logic

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* improve handling for non-flat-string context fields

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* rename TextSimilarity files to MLOpenSearch files

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* apply spotless after rebase

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* update changelog

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* after rebase

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* Address pr comments and fix XContent in search ext

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* move contextSourceFetchers to their own subdirectory

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* Apply suggestions from code review

Co-authored-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* CR changes

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* finish CR comments and fix broken unittest

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* fix unittest names

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

---------

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski added a commit to martin-gaievski/neural-search that referenced this pull request Feb 6, 2024
…anker for improving search relavancy. (opensearch-project#494)

* Add rerank processor interfaces

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add cross-encoder specific logic and factory

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add unittests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add integration test

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* use string.format() instead of concatenation

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* rename generateScoringContext to generateRerankingContext

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add name change in test too. whoops

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* start refactoring with contextSaourceFetchers

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* refactor to use contextSourceFetchers to get context

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* rename CrossEncoder to TextSimilarity

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add query_context layer to search ext

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add javadocs

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* update to new asyncProcessResponse api

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* rename reranktype to ML_OPENSEARCH

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* improve error messages for bad rerank type config

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* simplify configuration/factory logic

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* improve handling for non-flat-string context fields

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* rename TextSimilarity files to MLOpenSearch files

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* apply spotless after rebase

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* update changelog

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* after rebase

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* Address pr comments and fix XContent in search ext

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* move contextSourceFetchers to their own subdirectory

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* Apply suggestions from code review

Co-authored-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* CR changes

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* finish CR comments and fix broken unittest

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* fix unittest names

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

---------

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski mentioned this pull request Feb 6, 2024
5 tasks
martin-gaievski added a commit that referenced this pull request Feb 6, 2024
* Adding support for generic re-ranker interface and opensearch ml re-ranker for improving search relavancy. (#494)

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>

---------

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Co-authored-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Heemin Kim <heemin@amazon.com>
martin-gaievski added a commit to martin-gaievski/neural-search that referenced this pull request Feb 6, 2024
* Adding support for generic re-ranker interface and opensearch ml re-ranker for improving search relavancy. (opensearch-project#494)

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>

---------

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Co-authored-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Heemin Kim <heemin@amazon.com>
(cherry picked from commit 1bb48e2)
martin-gaievski added a commit to martin-gaievski/neural-search that referenced this pull request Feb 6, 2024
* Adding support for generic re-ranker interface and opensearch ml re-ranker for improving search relavancy. (opensearch-project#494)

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>

---------

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Co-authored-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Heemin Kim <heemin@amazon.com>
(cherry picked from commit 1bb48e2)
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski added a commit that referenced this pull request Feb 6, 2024
* Adding support for generic re-ranker interface and opensearch ml re-ranker for improving search relavancy. (#494)

(cherry picked from commit 1bb48e2)

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Co-authored-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Heemin Kim <heemin@amazon.com>
yuye-aws pushed a commit to yuye-aws/neural-search that referenced this pull request Mar 8, 2024
* Adding support for generic re-ranker interface and opensearch ml re-ranker for improving search relavancy. (opensearch-project#494)

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>

---------

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Co-authored-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement v2.12.0 Issues targeting release v2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants