Skip to content

Conversation

@Samoed
Copy link
Member

@Samoed Samoed commented Oct 25, 2025

Close #3406

I've implemented SearchEncoder protocol and now can be selected faiss or direct search (as previous).

class IndexEncoderSearchProtocol(Protocol):
"""Protocol for search backends used in encoder-based retrieval."""
def add_document(
self,
embeddings: Array,
idxs: list[str],
) -> None:
"""Add documents to the search backend.
Args:
embeddings: Embeddings of the documents to add.
idxs: IDs of the documents to add.
"""
def search(
self,
embeddings: Array,
top_k: int,
similarity_fn: Callable[[Array, Array], Array],
top_ranked: TopRankedDocumentsType | None = None,
query_idx_to_id: dict[int, str] | None = None,
) -> tuple[list[list[float]], list[list[int]]]:

I've saved "batched" approach for retrieval to store less memory during evaluation. Backend can be changed by

import mteb
from mteb.models.search_encoder_index import (
    DefaultEncoderSearchBackend,
    FaissEncoderSearchBackend,
)
from mteb.models import SearchEncoderWrapper


model = mteb.get_model("baseline/random-encoder-baseline")

python_backend = SearchEncoderWrapper(
    model, index_backend=DefaultEncoderSearchBackend()
)
faiss_backend = SearchEncoderWrapper(
    model, index_backend=FaissEncoderSearchBackend(model)
)

I've tested on Scifact using potion-2M and got 2s evaluation for default search and 3s for FAISS.

Script to test
import mteb
from mteb.cache import ResultCache
from mteb.models import SearchEncoderWrapper
from mteb.models.search_encoder_index import (
    DefaultEncoderSearchBackend,
    FaissEncoderSearchBackend,
)

model = mteb.get_model("minishlab/potion-base-2M")

python_backend = SearchEncoderWrapper(
    model, index_backend=DefaultEncoderSearchBackend()
)
faiss_backend = SearchEncoderWrapper(
    model, index_backend=FaissEncoderSearchBackend(model)
)

task = mteb.get_task("SciFact")

python_cache = ResultCache("python_backend_cache")
faiss_cache = ResultCache("faiss_backend_cache")

# warmup
mteb.evaluate(
    model,
    task,
    cache=None,
)

mteb.evaluate(
    python_backend,
    task,
    cache=python_cache,
)

mteb.evaluate(
    faiss_backend,
    task,
    cache=faiss_cache,
)

@Samoed Samoed changed the title add search backend feat: add search encoder backend Oct 25, 2025
Copy link
Contributor

@orionw orionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall!

I think it doesn't take advantage of faiss's built in scoring functionality but by not doing that we can have more control (so that can be ignored). If we wanted to use faiss we could do something like

# Batch reconstruct candidate embeddings
candidate_embs = np.vstack([
    self.index.reconstruct(idx) for idx in candidate_indices
])

# Create temporary index to let FAISS handle scoring
temp_index = self.index_type(d)
temp_index.add(candidate_embs)

# Search returns scores and indices in one call
scores, local_indices = temp_index.search(
    query_emb.reshape(1, -1).astype(np.float32),
    min(top_k, len(candidate_indices))
)

But I think it just does dot product. So it looks great as is, but just mentioning this in case that's helpful.

@Samoed
Copy link
Member Author

Samoed commented Oct 27, 2025

Yes, I think that's better. I've added support of cosine and dot product similarity support and scores are nearly the same (same for 1e-6).

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good I would probably restructure it a bit.

I would probably seperate out the implementations from the protocol.

We also need to add documentation on these backends as well as some discussion on the trade-offs between them.

from mteb.types import Array, TopRankedDocumentsType


class IndexEncoderSearchProtocol(Protocol):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class IndexEncoderSearchProtocol(Protocol):
class EncoderSearchProtocol(Protocol):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can specify that this is only for index and only for encoder, because this can be confused that SentenceTransformerEncoderWrapper will implement it (probably)

assert predictions == expected


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be placed here? I would say that it neither under abstasks nor test_predictions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created new folder test_search_index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't you put it test_models (to match the structure of the repo?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will move it here then

@Samoed
Copy link
Member Author

Samoed commented Oct 27, 2025

We also need to add documentation

Yes, wanted to add after your check on pr

@Samoed
Copy link
Member Author

Samoed commented Oct 28, 2025

I've run this script and both evaluation method took same time, so I'm unsure a bit what to add in advantages of FAISS, except of dumping index, but we're clearing it after evaluation.

task Stream FAISS
SWEbenchVerifiedRR 536 541
ClimateFEVERHardNegatives 9 12
import logging

import mteb
from mteb.cache import ResultCache
from mteb.models import SearchEncoderWrapper
from mteb.models.search_encoder_index import StreamingSearchIndex, FaissSearchIndex

logging.basicConfig(level=logging.INFO)

model = SearchEncoderWrapper(mteb.get_model("minishlab/potion-base-2M"))
tasks = mteb.get_tasks(
    tasks=[
        "ClimateFEVERHardNegatives",
        "SWEbenchVerifiedRR",
    ],
)

cache = ResultCache("stream")

mteb.evaluate(
    model,
    tasks,
    cache=cache,
)

### FAISS
index_backend = FaissSearchIndex(model)
model = SearchEncoderWrapper(
    mteb.get_model("minishlab/potion-base-2M"),
    index_backend=index_backend
)
cache = ResultCache("FAISS")

mteb.evaluate(
    model,
    tasks,
    cache=cache,
)

@orionw
Copy link
Contributor

orionw commented Oct 28, 2025

I think faiss is not ideal for smaller reranking cases (~100-1000 docs to search for). We should see dramatic gains for retrieval though, with a large enough corpus. For ClimateFEVERHardNegatives it could just be initialization differences. Maybe if you try MS MARCO for retrieval?

I asked Claude what it thinks we should do for reranking and it suggested we retrieve the vectors from faiss for reranking but just use standard numpy afterwards. We could do this, but if it's roughly the same to use faiss then we might as well keep what we have for that.

If large scale retrieval is much faster I think that's the main benefit

@Samoed
Copy link
Member Author

Samoed commented Oct 29, 2025

I tried running it on MSMARCO, and both backends showed similar times on sub-batches. If we remove the search over each sub-corpus batch, FAISS would probably show a speedup, but I’m not sure how to do that while still supporting the "streaming" backend.

class IndexEncoderSearchProtocol(Protocol):
"""Protocol for search backends used in encoder-based retrieval."""

def add_document(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def add_document(
def add_documents(

**encode_kwargs,
)

self.index_backend.add_document(sub_corpus_embeddings, sub_corpus_ids)
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't things we added in the index step? (I suspect it it here because of the streaming backend)

Copy link
Member Author

@Samoed Samoed Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't know how to make it compatible with streaming without creating new functions. If we want to use the index step, we need duplicate functions with similar logic. Maybe we can create different wrapper to use without hacks to use index directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can you outline the current approach in the streaming just to clarify why it is the way it is now (Just to make sure that we agree on the problem).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current approach with "streaming" don't store corpus embeddings fully for retrieval. It processes corpus in "chunks" of 50_000 documents and then finds similarities between documents and queries. Because of that we're adding sub_corpus_embeddings here (chunk embeddings). If we move encoding of documents in index, then we need to store all embeddings of documents, but this will cause problems with large datasets like miracl or msmarco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add similarity search backend to Retrieval tasks

4 participants