Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge or Add data into "ondisk" indices corrupts index ids #3498

Open
2 of 4 tasks
pablocael opened this issue Jun 7, 2024 · 2 comments
Open
2 of 4 tasks

Merge or Add data into "ondisk" indices corrupts index ids #3498

pablocael opened this issue Jun 7, 2024 · 2 comments

Comments

@pablocael
Copy link

pablocael commented Jun 7, 2024

Summary

When using on disk inverted lists indices (with ivfdata file), any of the methods:

  • merge_from
  • merge_into
  • add_with_ids

will make indices corrupt.

Platform

Linux Ubuntu 22.04

Faiss version: 1.8.0

Installed from: pip

Faiss compilation options: n/a

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

The tar.gz file has a self contained test that reproduces the problem, as well as a README explaining the issue and the test.

What the sample code does is to basically try to add data in a ondisk index using 3 methods, using inmemory index as baseline. In memory operations always succeed, while on disk operations always fail.

bug-report-faiss180-ondisk-update.tar.gz

Extract the below file and run the test:

python main.py
@pablocael pablocael changed the title Merge or Add data into "ondisk" indices corrupts the indices Merge or Add data into "ondisk" indices corrupts index ids Jun 10, 2024
@mdouze
Copy link
Contributor

mdouze commented Jun 21, 2024

About the merge_into inconsistency: this is because the baseline_index (that manages index.ivfdata) is opened twice, once as baseline_index and once as empty_trained in add_data. Therefore there is an inconsistency between the ivfdata content and the index in one of the two cases.
Also, it is not recommended to add vectors to an on-disk index, which is very slow. Instead, add vectors to one or several in-memory indexes and merge them afterwards, as in

https://github.com/facebookresearch/faiss/blob/e758973fa08164728eb9e136631fe6c57d7edf6c/demos/demo_ondisk_ivf.py

@pablocael
Copy link
Author

pablocael commented Jun 22, 2024

About the merge_into inconsistency: this is because the baseline_index (that manages index.ivfdata) is opened twice, once as baseline_index and once as empty_trained in add_data. Therefore there is an inconsistency between the ivfdata content and the index in one of the two cases. Also, it is not recommended to add vectors to an on-disk index, which is very slow. Instead, add vectors to one or several in-memory indexes and merge them afterwards, as in

https://github.com/facebookresearch/faiss/blob/e758973fa08164728eb9e136631fe6c57d7edf6c/demos/demo_ondisk_ivf.py

Thanks for checking on this @mdouze.

I dont think the baseline index is what is causing the issue.

I have simplified the test a lot, removed baseline indices, and Im using only merge_from to add new data from an in-memory index. Im also always isolating all indices now. The problem remains, this is the new simpler test:

bug-report-faiss180-ondisk-update-v2.tar.gz

Reproducing the issue is, in fact, very easy. Just open any ondisk index, and just add data to it using any preferred strategy. Then check the ids inside the index. About 13% of indices will get a negative values after insertion.

There is another report on the same issue.

Also, this issue is happening in my application by just loading a single healthy "ondisk" index and adding data to it.

About the add_index being slow, I understand that, but we do not have low latency requirement for adding data right now. I have no preference on the method for adding new data, as long as I can add new data to the on disk index.

I want to help investigating / solving this issue if possible. I understand that you are quite packed with other requests, so please let me know if I can help you creating some other tests to reproduce / debug.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants