[FEATURE] Support for setting a default min/max score for the upcoming Normalization and Score Combination feature #150

SeyedAlirezaFatemi · 2023-04-06T13:54:48Z

Is your feature request related to a problem?

Related to RFC. The current problem with the RFC is that when we are combining scores from different queries (e.g. BM25 and kNN), we need the min and max score of each query part. However, when using approximate kNN, we cannot accurately calculate the min score unless we do an exact kNN search on the index which is not feasible. This leads to inconsistent score normalization, particularly when using pagination.

What solution would you like?

As discussed in detail in the RFC, one solution is to rely on the statistics we get from the documents we see during the current query. However, in specific scenarios where the min score can be known, we can do better. For example, when using BM25 or Cosine similarity in kNN, the user can optionally define the min score in the query to be 0 and -1, respectively.

By allowing the user to optionally define a min/max score in the query for normalization, we can ensure consistent score normalization across different queries for specific scenarios, particularly when using pagination. This would improve the accuracy and reliability of the search results for users.

Here is an example where we have the issue of pagination inconsistency when we use the general solution:
Let's assume we have a query that consists of a text match query and a kNN query and we use this formula for score normalization:
x_normalized = (x – min) / (max – min)
and we set the page size to 10. Assume the top 10 kNN scores are between 1 and 0.9 and then the scores for the rest of the documents fall to 0. This changes the scores after normalization drastically if we go to the next page and we might get pagination inconsistency and get missing/double results.

The text was updated successfully, but these errors were encountered:

navneet1v · 2023-04-06T16:52:42Z

@SeyedAlirezaFatemi Thanks for creating the issue.

SeyedAlirezaFatemi · 2023-07-19T14:33:03Z

@navneet1v @martin-gaievski

I noticed that in the "An Analysis of Fusion Functions for Hybrid Retrieval" paper, they also mention a min-max normalization method ($𝜙_{TMM}$, Equation 4) that uses the theoretical minimum of a function.
"As an example, when $𝑓_{LEX}$ is BM25, then its infimum is 0. When $𝑓_{SEM}$ is cosine similarity, then that quantity is −1."

They also mention:
"Interestingly, the behavior of $𝜙_{TMM}$ appears to be more robust to the data distribution—its peak remains within a small neighborhood as we move from one dataset to another. We believe the reason $𝜙_{TMM}$-normalized scores are more stable is because it has one fewer data-dependent statistic in the transformation (i.e., minimum score in the retrieved set is replaced with minimum feasible value regardless of the candidate set)."

So It would be really nice to have this feature of defining a default min value for the normalization and get the max from the data.

navneet1v · 2023-07-19T16:14:47Z

@SeyedAlirezaFatemi thanks for providing this info. I will look into this. We are still in the development phase of the original scope.

heemin32 · 2024-11-20T22:32:59Z

@SeyedAlirezaFatemi, is the inconsistent pagination result the main reason for supporting this? Even with the customer-provided min/max score, the inconsistency in pagination will still occur. There's an ongoing project aimed at improving pagination consistency for hybrid search. It would be great if you could take a look at #933 and share your thoughts on whether this feature would still provide value.

martin-gaievski · 2025-01-09T02:03:50Z

The way we implement pagination will not eliminate the problem described in this issue. we allow user to provide the size of the window for pagination with the new parameter pagination_depth, but that window will be often smaller then the size of actually matching docs. For instance knn and neural queries will give some positive score to every document in the index. So technically this request makes sense, although I'm nor sure how often that is needed to real life use cases.

@SeyedAlirezaFatemi did you have a chance to review @heemin32 question and mentioned RFC for pagination in hybrid query #933?

SeyedAlirezaFatemi added the untriaged label Apr 6, 2023

navneet1v added Enhancements Increases software capabilities beyond original client specifications and removed untriaged labels Apr 6, 2023

navneet1v assigned vamshin Apr 6, 2023

navneet1v added Features Introduces a new unit of functionality that satisfies a requirement backlog All the backlog features should be marked with this label neural-search labels Apr 6, 2023

martin-gaievski mentioned this issue May 17, 2023

Add main classes for Query and basic unit tests #172

Merged

3 tasks

navneet1v mentioned this issue Sep 15, 2023

Hybrid search scoring is dependent on number of results requested #325

Closed

navneet1v added this to Vector Search RoadMap Jul 29, 2024

github-project-automation bot moved this to Backlog in Vector Search RoadMap Jul 29, 2024

minalsha added hybrid search and removed neural-search labels Jan 9, 2025

minalsha assigned owaiskazi19 and unassigned vamshin Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support for setting a default min/max score for the upcoming Normalization and Score Combination feature #150

[FEATURE] Support for setting a default min/max score for the upcoming Normalization and Score Combination feature #150

SeyedAlirezaFatemi commented Apr 6, 2023

navneet1v commented Apr 6, 2023

SeyedAlirezaFatemi commented Jul 19, 2023

navneet1v commented Jul 19, 2023

heemin32 commented Nov 20, 2024

martin-gaievski commented Jan 9, 2025 •

edited

Loading

[FEATURE] Support for setting a default min/max score for the upcoming Normalization and Score Combination feature #150

[FEATURE] Support for setting a default min/max score for the upcoming Normalization and Score Combination feature #150

Comments

SeyedAlirezaFatemi commented Apr 6, 2023

Is your feature request related to a problem?

What solution would you like?

navneet1v commented Apr 6, 2023

SeyedAlirezaFatemi commented Jul 19, 2023

navneet1v commented Jul 19, 2023

heemin32 commented Nov 20, 2024

martin-gaievski commented Jan 9, 2025 • edited Loading

martin-gaievski commented Jan 9, 2025 •

edited

Loading