-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge_from corrupts the resulting index by default #3651
Comments
Hi, I see quite a few tests on merge_from in faiss/tests/test_merge_index.py |
Hi @kuarora Sure. See notebook attached. https://gist.github.com/guillaumeguy/1f992f4f64e5180c8318f435f991d5ce |
Hi, can this be related to the issue I posted and was (imho wrongly) flaged as invalid? #3498 |
Hi @kuarora: Does the notebook work for you? |
Maybe we should re-visit this |
Yes, I am able to reproduce only when we don't use offset. |
There are two ways of constructing an IVF index:
In the first case, when merging, the ids of the second index should be shifted to follow the ones from the first index (hence the add_id = index1.ntotal) Maybe we could issue a warning but it's not even clear in which of the two cases we should warn. |
Hi, I had a chance to debug this issue further. merge_from is working as expected and store codes and ids in list with closest centroid but reconstruct fails as direct map (array list invl) is expected to work with sequential ids and no-collision between ids. Once merge, it is very difficult to distinguish which embedding to return upon reconstruction if there is collision as offset for ids will be one last seen. So, we can’t use directmap for reconstruction when there is id collision. Possible warning we could throw when direct map is created. During its population, we can throw warning if id is already populated. Bento changes:
|
Summary
Sharding index hydration is a common pattern for us.
After hydration, the user can then combine the indices with
index.merge_from(index2)
. However, this could corrupt the index silently and it is very hard to debug (each individual index is fine, the merge is wrong). The right syntax isindex.merge_from(index2, index.ntotal)
.Unhelpfully, SO even recommends the bad syntax.
It would be nice to either default to the right pattern or emit a warning.
Platform
Linux 22.04
OS:
Faiss version:
1.8.0
Installed from:
pip
Faiss compilation options:
Running on:
Interface:
Reproduction instructions
The text was updated successfully, but these errors were encountered: