-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: index.search returns invalid keys when k > index size #393
Comments
Great catch, @andersonbcdefg! On it ;) |
If you'd have a second to submit a PR with a minimal test case, that would help 🤗 |
@andersonbcdefg I have tried reproducing your issue and couldn't locate it in either C++ or Python layer. Please check out the 7db0c39 🤗 |
## [2.11.4](v2.11.3...v2.11.4) (2024-04-11) ### Fix * `#[repr(transparent)]` for `f16` and `b1x8` ([e182d77](e182d77)) * Non-native `f16` haversine ([c7922ff](c7922ff)) ### Make * Bump versions in `CMakeLists.txt` ([aadb717](aadb717)) * Rever WASI CI ([92e0b94](92e0b94)) ### Test * Oversubscribed search ([7db0c39](7db0c39)), closes [#393](#393) * Re-seed NumPy PRNG ([8de87df](8de87df))
Nice! Thanks for the quick fix here :) |
Not really a fix, @andersonbcdefg, as I couldn’t reproduce, but please lmk if the issue persists 🤗 |
I'm actually encountering this same issue.
This repro constitently produces 1367 random, extremely high values on my machine. Strangely, the number goes down to 490 if I compare to 0 later:
Lastly, adding a sleep before I do the read:
results in all 0s / no crazy values. |
Should probably reopen @ashvardanian |
@fmmoret please check the other members of the returned structure, not just |
Which ones are you interested in? |
An iterator over the search results or a .to_list() mitigate the error but they are comparatively very slow. |
In my application, I'm doing lots of searches against multiple indices (some searches against the same index). I'm using locks around searching the same index.
EDIT: |
Weird: if I replay the same search on these ones with 0 matches, they come back with full matches. Very strange behavior |
I got around this issue with
The python impl doesn't have a great way to check that indexing finished. The progress callback shows progress at intervals but does not tell you when indexing actually finished. |
If you check the object it has just 3 array fields - |
This is a bad idea. Vector Search structures have logarithmic complexity for most operations. So you want to avoid sharding. The bigger the index, the more efficient is search compared to brute force. Moreover, if you are dealing with only thousands of vectors, you don't need a vector search index at all and may consider using SimSIMD. |
I understand your concern -- the case I shared is a toy example at local dev scale. My full scale app has multiple indices w/ millions of vectors. There are multiple indices because they are using fully different embedding spaces. |
Describe the bug
When creating a small index (size ~100) and searching it, I noticed it always returns some extremely large keys that are not actually present in the index. Here's an example of my own print debugging output:
Warning: 92 samples not in index_labels: [94264669291728, 119, 9223372036854775808, 94264674115280, 94264674115312, 208, 94278552092195, 139415592574144, 9223372036854775808, 753, 94264669262496, 94264673534624, 94264668185904, 94264676614832, 94264661017696, 94264673538080, 94264676662528, 94264682064784, 94264681824960, 107, 94264655611472, 94264655611504, 94264676636832, 113, 94264676436544, 186, 94264675185232, 94264675185264, 94264662133040, 189, 94264676662400, 94264676662432, 94264676992352, 94264673533232, 94264672356160, 257, 94264674115376, 208, 94278552062388, 139415592574144, 6305, 94264669268912, 139415592574144, 94264676412336, 94264676412304, 208, 94278552063349, 139415592574144, 5345, 94264669268912, 139415592574144, 2256, 94264669235504, 94264669279184, 94264674131696, 94264680759264, 94264669315648, 94264669315680, 94264669259168, 94264669259200, 94264676396400, 94264676396432, 94264676396464, 94264676396496, 94264674051952, 94264674051984, 114, 9223372036854775808, 94264676462544, 208, 94278552066357, 139415592574144, 11457, 94264669235920, 139415592574144, 94261647247312, 94264674130320, 208, 94278552067317, 139415592574144, 10497, 94264669235920, 139415592574144, 94261647247309, 94264669280224, 182, 9223372036854775808, 23013835270, 94278552069446, 94264669295984, 94264669271152, 94278552069510]
On further debugging, I realized this happens when I search for more "neighbors" than there are points in the index.
Steps to reproduce
Here is how I get the error: I create an Index, embed some documents (around 100) and then insert them. When I search for > 100 neighbors, I get back invalid keys.
Expected behavior
index.search should only return keys that are present in the index. if searching for fewer items than there are in the index, i would expect to either only get back the items in the index (i.e. fewer than I asked for!), or else raise some kind of error message. silently returning invalid keys is not user-friendly! (in my opinion)
USearch version
2.11.3
Operating System
Debian slim-bookworm
Hardware architecture
x86
Which interface are you using?
Python bindings
Contact Details
andersonbcdefg@gmail.com
Is there an existing issue for this?
Code of Conduct
The text was updated successfully, but these errors were encountered: