-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function score query omits relevant results on large dataset #298
Comments
Hi, thanks for the kind remarks. Permutation LSH is mostly intended for Cosine similarity. It can be computed w/ L1 and L2, but it doesn't really make sense to use them together. So if you need L2 similarity, I would recommend the Cosine LSH Mapping and query: https://elastiknn.com/api/#cosine-lsh-mapping, https://elastiknn.com/api/#cosine-lsh-query. Those can of course also be used with the function score queries. Also, if you are bottlenecked by performance even after filtering, read this section of the docs: https://elastiknn.com/api/#using-stored-fields-for-faster-queries |
Thanks for the quick reply. Just ran the above query with cosine similarity (data is currently mapped with permutation_lsh) and got bad results.
I am getting very good results, thus I suspect that the issue is related with the combination of score_function and lsh_permutation. |
It's possible your data just isn't suited for permutation LSH. It basically assumes there are meaningful differences in the absolute values in your vectors, which is not always a good assumption. The docs explain more how that algo works.
Note this caveat from the docs:
I'd need a sample of data and some way to easily reproduce the problem. |
Oh, this caveat does explain the behavior I was witnessing! just to make sure I understand, the caveat applies to any kind of "lsh" model (e.g. permutation_lsh), correct? |
Hi Alex,
I hope this info is sufficient for you to check if there is maybe a bug there, concerning permutation_lsh + function score combination. I am planning to run this experiment again with model = lsh, and will update the results here soon. |
Hey, thanks for all the detail. I'll try to find some time to review your results this week. |
Ok, some followup questions: Are you specifically using the Just as a sanity check, when you did the function score query, did all of the returned docs have country = Belgium? How many shards to you have in the index? Does the behavior change if you use a different number of shards? I agree this is a strange behavior. I have a guess for what's happening but it could be completely wrong based on the answers to the questions above. |
Hi!
Yes.
Yes, when running the second "body" from the original post.
At time of trying, we had number of shards = 1. I will retry with shards = 5 but only by the end of the week and keep you updated on the results. Thanks a lot, Yonatan. |
Thanks. I'm wondering if the problem is that the query is matching the first 1k/2k/3k documents on the provided filter (Belgium) and then only scoring and re-ranking those first 1k/2k/3k documents that it matched. I don't see this behavior documented in the ES docs though. Hmm. |
I'll try to find some time this week to reproduce this pattern. If you can, it would be interesting to see if you get similar results using the query rescorer. |
Follow up: However, I also did the following experiment that confused me a bit: Then I ran a filtered query on the same vector, this time filtering country == Poland, using the syntax from above and k == candidates == 1000:
I attained a list of 1000 events from Poland, which I will call the pre-query filtered list Nevertheless, I expected that all 72 results from the post - query filtered list to appear in the pre - query filtered list, since k == candidates, but that was not the case. As a matter of fact none of them were there. Maybe related question: |
Thanks, this is useful information. It sounds like it could be enough to reproduce it with synthetic data. If I understand your description, it sounds like the following is happening, (quoting from above):
By trying it with more shards, you've distributed the relevant documents, so more of them will show up in the first I'll try to reproduce this one day this week with some synthetic data in an integration test. As a final sanity check, could you see what happens when you set |
I changed the title to something that seems to describe the exact issue. I also added the bug tag because this seems like a genuine bug, or at the very least, some strange Elasticsearch behavior that should be documented. |
Hi Alex, sorry for the big delay.
I think that the fact that the results get better as number of shards increase indicate something like you describe... Please update if there is any new discoveries on the subject, |
Hi @alexklibisz , |
Not yet. I will hopefully have some time to look into it this week. There's also a development guide in the repo if anyone wants to look into it in the meantime. |
I was able to reproduce the issue in #306, but so far no fix. It seems my guess was correct. It looks like Elasticsearch does this:
Whereas the behavior we want is this:
I looked through the docs and the FunctionScoreQuery Implementation. I still don't see any option to apply the function to all documents that match the standard query. For now my best advice is to:
I posted a question on the Elastic forum. I'm curious what they say. |
For some insight, I thought that maybe the |
So far no response on the Elastic discussion board. If there's nothing by the end of the week, I'll most likely just update the docs to reflect this quirk and close this issue. I don't have the time right now to prioritize much more than that. |
Hi! |
I don't think so. All of that is managed by Elasticsearch so the best we can do is run the pre-filtering query as a standard query and get back the standard query response. |
Hi! First of all, let me say I admire your work, it is truly amazing.
When running the following query:
I am getting great results. However, I can not afford 'exact' queries. When Trying the following query:
(note that the only difference is the "model": permutation_lsh instead of exact) I'm getting bad results.
In the examples above self._similarity == "L2"
I hope I am not missing anything and wasting your time. Can this be supported?
The text was updated successfully, but these errors were encountered: