Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_from corrupts the resulting index by default #3651

Closed
2 of 4 tasks
guillaumeguy opened this issue Jul 18, 2024 · 8 comments
Closed
2 of 4 tasks

merge_from corrupts the resulting index by default #3651

guillaumeguy opened this issue Jul 18, 2024 · 8 comments

Comments

@guillaumeguy
Copy link

guillaumeguy commented Jul 18, 2024

Summary

Sharding index hydration is a common pattern for us.

After hydration, the user can then combine the indices with index.merge_from(index2). However, this could corrupt the index silently and it is very hard to debug (each individual index is fine, the merge is wrong). The right syntax is index.merge_from(index2, index.ntotal).

Unhelpfully, SO even recommends the bad syntax.

It would be nice to either default to the right pattern or emit a warning.

Platform

Linux 22.04

OS:

Faiss version:
1.8.0
Installed from:
pip
Faiss compilation options:

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

# Pseudo-code

# Normal training
index_str = 'OPQ256_1280,IVF2048_HNSW32,PQ256x8'
index = faiss.index_factory(EMB_DIM, index_str, faiss.METRIC_INNER_PRODUCT)
faiss.normalize_L2(embs)
index.train(embs)

# Shard 1 
faiss.normalize_L2(embs)
index1.add(embs)

## shard 2 
faiss.normalize_L2(embs)
index2.add(embs)

# combine 
index1.merge_from(index2) 

# search
index1.search(q, 10) # This will return bad results
@kuarora
Copy link
Contributor

kuarora commented Jul 18, 2024

Hi,

I see quite a few tests on merge_from in faiss/tests/test_merge_index.py
are you able to reproduce it with a standard dataset, or may be a test?

@guillaumeguy
Copy link
Author

guillaumeguy commented Jul 19, 2024

@pablocael
Copy link

Hi, can this be related to the issue I posted and was (imho wrongly) flaged as invalid? #3498

@guillaumeguy
Copy link
Author

Hi @kuarora: Does the notebook work for you?

@mdouze
Copy link
Contributor

mdouze commented Jul 24, 2024

Maybe we should re-visit this

@kuarora
Copy link
Contributor

kuarora commented Jul 24, 2024

Hi @kuarora: Does the notebook work for you?

Yes, I am able to reproduce only when we don't use offset.
Once we discuss internally, I'll get back to you.

@mdouze
Copy link
Contributor

mdouze commented Aug 2, 2024

There are two ways of constructing an IVF index:

  • with add(): the ids are allocated sequentially by the index
  • with add_with_ids(): the ids are provided by the user.

In the first case, when merging, the ids of the second index should be shifted to follow the ones from the first index (hence the add_id = index1.ntotal)
In the second case, we assume the user manages their own ids (that could be internal ids), so when merging, they should be left unchanged (add_id = 0).
Faiss does not keep track of which add or add_with_ids was used to populate the indexes, so it trusts the user to provide the add_id parameter to determine how the index should be processed.

Maybe we could issue a warning but it's not even clear in which of the two cases we should warn.

@kuarora
Copy link
Contributor

kuarora commented Aug 10, 2024

Hi,

I had a chance to debug this issue further.

merge_from is working as expected and store codes and ids in list with closest centroid but reconstruct fails as direct map (array list invl) is expected to work with sequential ids and no-collision between ids. Once merge, it is very difficult to distinguish which embedding to return upon reconstruction if there is collision as offset for ids will be one last seen.

So, we can’t use directmap for reconstruction when there is id collision.

Possible warning we could throw when direct map is created. During its population, we can throw warning if id is already populated.
I was able to reproduce it with smaller dataset and ivfflat index, where ideally, reconstruction should have zero error.

Bento changes:

def run(q, embs1, embs2, single_index, offset=0):
    index1 = faiss.read_index(p)
    index2 = faiss.read_index(p)

    # index1 = faiss.index_factory(d, index_str, faiss.METRIC_INNER_PRODUCT)
    # index1.train(training_data)

    # index2 = faiss.index_factory(d, index_str, faiss.METRIC_INNER_PRODUCT)
    # index2.train(training_data)

    # Shard 1
    faiss.normalize_L2(embs1)
    index1.add(embs1)
    faiss.normalize_L2(embs2)

    ## shard 2
    if single_index:
        index1.add(embs2)
    else:
        index2.add(embs2)
        print(index1.ntotal, index2.ntotal)
        # combine
        # if offset > 0:
        #     index1.merge_from(index2, offset)
        # else:
        #     index1.merge_from(index2)
        index1.merge_from(index2, offset)
    print(index1.ntotal)

    # search
    D, I = index1.search(q, 5)
    print(D, I)
    faiss.extract_index_ivf(index1).make_direct_map()

    reconstructed = index1.reconstruct_batch([int(id) for id in np.ravel(I)])
    print (reconstructed)

    print(np.diagonal((reconstructed @ np.vstack([embs1, embs2])[np.ravel(I)].T)))


Compare following two 
# single index
run(q, embs1, embs2, True, 0)

# single index
run(q, embs1, embs2, False, 0)

Screenshot 2024-08-09 at 4 46 42 PM

@kuarora kuarora closed this as completed Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants