-
Notifications
You must be signed in to change notification settings - Fork 496
feat: add search encoder backend #3492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall!
I think it doesn't take advantage of faiss's built in scoring functionality but by not doing that we can have more control (so that can be ignored). If we wanted to use faiss we could do something like
# Batch reconstruct candidate embeddings
candidate_embs = np.vstack([
self.index.reconstruct(idx) for idx in candidate_indices
])
# Create temporary index to let FAISS handle scoring
temp_index = self.index_type(d)
temp_index.add(candidate_embs)
# Search returns scores and indices in one call
scores, local_indices = temp_index.search(
query_emb.reshape(1, -1).astype(np.float32),
min(top_k, len(candidate_indices))
)
But I think it just does dot product. So it looks great as is, but just mentioning this in case that's helpful.
|
Yes, I think that's better. I've added support of cosine and dot product similarity support and scores are nearly the same (same for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good I would probably restructure it a bit.
I would probably seperate out the implementations from the protocol.
We also need to add documentation on these backends as well as some discussion on the trade-offs between them.
| from mteb.types import Array, TopRankedDocumentsType | ||
|
|
||
|
|
||
| class IndexEncoderSearchProtocol(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class IndexEncoderSearchProtocol(Protocol): | |
| class EncoderSearchProtocol(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can specify that this is only for index and only for encoder, because this can be confused that SentenceTransformerEncoderWrapper will implement it (probably)
| assert predictions == expected | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be placed here? I would say that it neither under abstasks nor test_predictions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created new folder test_search_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't you put it test_models (to match the structure of the repo?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will move it here then
Yes, wanted to add after your check on pr |
|
I've run this script and both evaluation method took same time, so I'm unsure a bit what to add in advantages of FAISS, except of dumping index, but we're clearing it after evaluation.
import logging
import mteb
from mteb.cache import ResultCache
from mteb.models import SearchEncoderWrapper
from mteb.models.search_encoder_index import StreamingSearchIndex, FaissSearchIndex
logging.basicConfig(level=logging.INFO)
model = SearchEncoderWrapper(mteb.get_model("minishlab/potion-base-2M"))
tasks = mteb.get_tasks(
tasks=[
"ClimateFEVERHardNegatives",
"SWEbenchVerifiedRR",
],
)
cache = ResultCache("stream")
mteb.evaluate(
model,
tasks,
cache=cache,
)
### FAISS
index_backend = FaissSearchIndex(model)
model = SearchEncoderWrapper(
mteb.get_model("minishlab/potion-base-2M"),
index_backend=index_backend
)
cache = ResultCache("FAISS")
mteb.evaluate(
model,
tasks,
cache=cache,
) |
|
I think faiss is not ideal for smaller reranking cases (~100-1000 docs to search for). We should see dramatic gains for retrieval though, with a large enough corpus. For I asked Claude what it thinks we should do for reranking and it suggested we retrieve the vectors from If large scale retrieval is much faster I think that's the main benefit |
|
I tried running it on |
| class IndexEncoderSearchProtocol(Protocol): | ||
| """Protocol for search backends used in encoder-based retrieval.""" | ||
|
|
||
| def add_document( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def add_document( | |
| def add_documents( |
| **encode_kwargs, | ||
| ) | ||
|
|
||
| self.index_backend.add_document(sub_corpus_embeddings, sub_corpus_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't things we added in the index step? (I suspect it it here because of the streaming backend)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I don't know how to make it compatible with streaming without creating new functions. If we want to use the index step, we need duplicate functions with similar logic. Maybe we can create different wrapper to use without hacks to use index directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can you outline the current approach in the streaming just to clarify why it is the way it is now (Just to make sure that we agree on the problem).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current approach with "streaming" don't store corpus embeddings fully for retrieval. It processes corpus in "chunks" of 50_000 documents and then finds similarities between documents and queries. Because of that we're adding sub_corpus_embeddings here (chunk embeddings). If we move encoding of documents in index, then we need to store all embeddings of documents, but this will cause problems with large datasets like miracl or msmarco
Close #3406
I've implemented SearchEncoder protocol and now can be selected faiss or direct search (as previous).
mteb/mteb/models/search_encoder_index/search_backend_protocol.py
Lines 7 to 29 in c8b2bd3
I've saved "batched" approach for retrieval to store less memory during evaluation. Backend can be changed by
I've tested on Scifact using
potion-2Mand got 2s evaluation for default search and 3s forFAISS.Script to test