-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Support for setting a default min/max score for the upcoming Normalization and Score Combination feature #150
Comments
@SeyedAlirezaFatemi Thanks for creating the issue. |
I noticed that in the "An Analysis of Fusion Functions for Hybrid Retrieval" paper, they also mention a min-max normalization method ( They also mention: So It would be really nice to have this feature of defining a default min value for the normalization and get the max from the data. |
@SeyedAlirezaFatemi thanks for providing this info. I will look into this. We are still in the development phase of the original scope. |
@SeyedAlirezaFatemi, is the inconsistent pagination result the main reason for supporting this? Even with the customer-provided min/max score, the inconsistency in pagination will still occur. There's an ongoing project aimed at improving pagination consistency for hybrid search. It would be great if you could take a look at #933 and share your thoughts on whether this feature would still provide value. |
The way we implement pagination will not eliminate the problem described in this issue. we allow user to provide the size of the window for pagination with the new parameter @SeyedAlirezaFatemi did you have a chance to review @heemin32 question and mentioned RFC for pagination in hybrid query #933? |
Is your feature request related to a problem?
Related to RFC. The current problem with the RFC is that when we are combining scores from different queries (e.g. BM25 and kNN), we need the min and max score of each query part. However, when using approximate kNN, we cannot accurately calculate the min score unless we do an exact kNN search on the index which is not feasible. This leads to inconsistent score normalization, particularly when using pagination.
What solution would you like?
As discussed in detail in the RFC, one solution is to rely on the statistics we get from the documents we see during the current query. However, in specific scenarios where the min score can be known, we can do better. For example, when using BM25 or Cosine similarity in kNN, the user can optionally define the min score in the query to be 0 and -1, respectively.
By allowing the user to optionally define a min/max score in the query for normalization, we can ensure consistent score normalization across different queries for specific scenarios, particularly when using pagination. This would improve the accuracy and reliability of the search results for users.
Here is an example where we have the issue of pagination inconsistency when we use the general solution:
Let's assume we have a query that consists of a text match query and a kNN query and we use this formula for score normalization:
x_normalized = (x – min) / (max – min)
and we set the page size to 10. Assume the top 10 kNN scores are between 1 and 0.9 and then the scores for the rest of the documents fall to 0. This changes the scores after normalization drastically if we go to the next page and we might get pagination inconsistency and get missing/double results.
The text was updated successfully, but these errors were encountered: