feat: add search encoder backend #3492

Samoed · 2025-10-25T21:14:23Z

I've implemented SearchEncoder protocol and now can be selected faiss or direct search (as previous).

mteb/mteb/models/search_encoder_index/search_backend_protocol.py

Lines 7 to 29 in c8b2bd3

    
           class IndexEncoderSearchProtocol(Protocol): 
        
               """Protocol for search backends used in encoder-based retrieval.""" 
        
               def add_document( 
        
                   self, 
        
                   embeddings: Array, 
        
                   idxs: list[str], 
        
               ) -> None: 
        
                   """Add documents to the search backend. 
        
                   Args: 
        
                       embeddings: Embeddings of the documents to add. 
        
                       idxs: IDs of the documents to add. 
        
                   """ 
        
               def search( 
        
                   self, 
        
                   embeddings: Array, 
        
                   top_k: int, 
        
                   similarity_fn: Callable[[Array, Array], Array], 
        
                   top_ranked: TopRankedDocumentsType | None = None, 
        
                   query_idx_to_id: dict[int, str] | None = None, 
        
               ) -> tuple[list[list[float]], list[list[int]]]:

I've saved "batched" approach for retrieval to store less memory during evaluation. Backend can be changed by

import mteb
from mteb.models.search_encoder_index import (
    DefaultEncoderSearchBackend,
    FaissEncoderSearchBackend,
)
from mteb.models import SearchEncoderWrapper


model = mteb.get_model("baseline/random-encoder-baseline")

python_backend = SearchEncoderWrapper(
    model, index_backend=DefaultEncoderSearchBackend()
)
faiss_backend = SearchEncoderWrapper(
    model, index_backend=FaissEncoderSearchBackend(model)
)

I've tested on Scifact using potion-2M and got 2s evaluation for default search and 3s for FAISS.

Script to test

import mteb
from mteb.cache import ResultCache
from mteb.models import SearchEncoderWrapper
from mteb.models.search_encoder_index import (
    DefaultEncoderSearchBackend,
    FaissEncoderSearchBackend,
)

model = mteb.get_model("minishlab/potion-base-2M")

python_backend = SearchEncoderWrapper(
    model, index_backend=DefaultEncoderSearchBackend()
)
faiss_backend = SearchEncoderWrapper(
    model, index_backend=FaissEncoderSearchBackend(model)
)

task = mteb.get_task("SciFact")

python_cache = ResultCache("python_backend_cache")
faiss_cache = ResultCache("faiss_backend_cache")

# warmup
mteb.evaluate(
    model,
    task,
    cache=None,
)

mteb.evaluate(
    python_backend,
    task,
    cache=python_cache,
)

mteb.evaluate(
    faiss_backend,
    task,
    cache=faiss_cache,
)

orionw

Looks great overall!

I think it doesn't take advantage of faiss's built in scoring functionality but by not doing that we can have more control (so that can be ignored). If we wanted to use faiss we could do something like

# Batch reconstruct candidate embeddings
candidate_embs = np.vstack([
    self.index.reconstruct(idx) for idx in candidate_indices
])

# Create temporary index to let FAISS handle scoring
temp_index = self.index_type(d)
temp_index.add(candidate_embs)

# Search returns scores and indices in one call
scores, local_indices = temp_index.search(
    query_emb.reshape(1, -1).astype(np.float32),
    min(top_k, len(candidate_indices))
)

But I think it just does dot product. So it looks great as is, but just mentioning this in case that's helpful.

Samoed · 2025-10-27T07:15:39Z

Yes, I think that's better. I've added support of cosine and dot product similarity support and scores are nearly the same (same for 1e-6).

KennethEnevoldsen

Looks good I would probably restructure it a bit.

I would probably seperate out the implementations from the protocol.

We also need to add documentation on these backends as well as some discussion on the trade-offs between them.

KennethEnevoldsen · 2025-10-27T09:59:44Z

mteb/models/search_encoder_index/search_backend_protocol.py

+from mteb.types import Array, TopRankedDocumentsType
+
+
+class IndexEncoderSearchProtocol(Protocol):


Suggested change

class IndexEncoderSearchProtocol(Protocol):

class EncoderSearchProtocol(Protocol):

I think we can specify that this is only for index and only for encoder, because this can be confused that SentenceTransformerEncoderWrapper will implement it (probably)

mteb/models/search_encoder_index/default_backend_search.py

mteb/models/search_encoder_index/faiss_search_backend.py

KennethEnevoldsen · 2025-10-27T10:02:09Z

tests/test_abstasks/test_predictions.py

    assert predictions == expected
+
+
+@pytest.mark.parametrize(


should this be placed here? I would say that it neither under abstasks nor test_predictions

Created new folder test_search_index

wouldn't you put it test_models (to match the structure of the repo?)

Yes, will move it here then

Samoed · 2025-10-27T10:41:17Z

We also need to add documentation

Yes, wanted to add after your check on pr

Samoed · 2025-10-28T10:05:25Z

I've run this script and both evaluation method took same time, so I'm unsure a bit what to add in advantages of FAISS, except of dumping index, but we're clearing it after evaluation.

task	Stream	FAISS
SWEbenchVerifiedRR	536	541
ClimateFEVERHardNegatives	9	12

import logging

import mteb
from mteb.cache import ResultCache
from mteb.models import SearchEncoderWrapper
from mteb.models.search_encoder_index import StreamingSearchIndex, FaissSearchIndex

logging.basicConfig(level=logging.INFO)

model = SearchEncoderWrapper(mteb.get_model("minishlab/potion-base-2M"))
tasks = mteb.get_tasks(
    tasks=[
        "ClimateFEVERHardNegatives",
        "SWEbenchVerifiedRR",
    ],
)

cache = ResultCache("stream")

mteb.evaluate(
    model,
    tasks,
    cache=cache,
)

### FAISS
index_backend = FaissSearchIndex(model)
model = SearchEncoderWrapper(
    mteb.get_model("minishlab/potion-base-2M"),
    index_backend=index_backend
)
cache = ResultCache("FAISS")

mteb.evaluate(
    model,
    tasks,
    cache=cache,
)

orionw · 2025-10-28T15:55:59Z

I think faiss is not ideal for smaller reranking cases (~100-1000 docs to search for). We should see dramatic gains for retrieval though, with a large enough corpus. For ClimateFEVERHardNegatives it could just be initialization differences. Maybe if you try MS MARCO for retrieval?

I asked Claude what it thinks we should do for reranking and it suggested we retrieve the vectors from faiss for reranking but just use standard numpy afterwards. We could do this, but if it's roughly the same to use faiss then we might as well keep what we have for that.

If large scale retrieval is much faster I think that's the main benefit

Samoed · 2025-10-29T07:02:02Z

I tried running it on MSMARCO, and both backends showed similar times on sub-batches. If we remove the search over each sub-corpus batch, FAISS would probably show a speedup, but I’m not sure how to do that while still supporting the "streaming" backend.

KennethEnevoldsen · 2025-10-30T11:26:43Z

mteb/models/search_encoder_index/search_backend_protocol.py

+class IndexEncoderSearchProtocol(Protocol):
+    """Protocol for search backends used in encoder-based retrieval."""
+
+    def add_document(


Suggested change

def add_document(

def add_documents(

KennethEnevoldsen · 2025-10-30T11:29:08Z

mteb/models/search_wrappers.py

                **encode_kwargs,
            )
-
+            self.index_backend.add_document(sub_corpus_embeddings, sub_corpus_ids)


shouldn't things we added in the index step? (I suspect it it here because of the streaming backend)

Yes, but I don't know how to make it compatible with streaming without creating new functions. If we want to use the index step, we need duplicate functions with similar logic. Maybe we can create different wrapper to use without hacks to use index directly

Hmm can you outline the current approach in the streaming just to clarify why it is the way it is now (Just to make sure that we agree on the problem).

Current approach with "streaming" don't store corpus embeddings fully for retrieval. It processes corpus in "chunks" of 50_000 documents and then finds similarities between documents and queries. Because of that we're adding sub_corpus_embeddings here (chunk embeddings). If we move encoding of documents in index, then we need to store all embeddings of documents, but this will cause problems with large datasets like miracl or msmarco

add search backend

c8b2bd3

Samoed requested review from KennethEnevoldsen and orionw October 25, 2025 21:14

Samoed changed the title ~~add search backend~~ feat: add search encoder backend Oct 25, 2025

Samoed added 2 commits October 26, 2025 00:19

make faiss optional

8a3527f

fix import

b2c3f60

orionw approved these changes Oct 26, 2025

View reviewed changes

Samoed added 4 commits October 27, 2025 10:29

use faiss in reranking

51111ca

add support for multiple similarities

ae31d1b

remove check

2ce10fd

update index check

74458c5

KennethEnevoldsen reviewed Oct 27, 2025

View reviewed changes

Samoed added 3 commits October 27, 2025 14:03

rename and move files

05b0ba8

add missing files

48143c0

fix import

7fbc60f

KennethEnevoldsen reviewed Oct 30, 2025

View reviewed changes

	class IndexEncoderSearchProtocol(Protocol):
	"""Protocol for search backends used in encoder-based retrieval."""

	def add_document(
	self,
	embeddings: Array,
	idxs: list[str],
	) -> None:
	"""Add documents to the search backend.

	Args:
	embeddings: Embeddings of the documents to add.
	idxs: IDs of the documents to add.
	"""

	def search(
	self,
	embeddings: Array,
	top_k: int,
	similarity_fn: Callable[[Array, Array], Array],
	top_ranked: TopRankedDocumentsType \| None = None,
	query_idx_to_id: dict[int, str] \| None = None,
	) -> tuple[list[list[float]], list[list[int]]]:

		from mteb.types import Array, TopRankedDocumentsType


		class IndexEncoderSearchProtocol(Protocol):

	class IndexEncoderSearchProtocol(Protocol):
	class EncoderSearchProtocol(Protocol):

feat: add search encoder backend #3492

Are you sure you want to change the base?

feat: add search encoder backend #3492

Conversation

Samoed commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionw left a comment

Choose a reason for hiding this comment

Uh oh!

Samoed commented Oct 27, 2025

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed commented Oct 27, 2025

Uh oh!

Samoed commented Oct 28, 2025

Uh oh!

orionw commented Oct 28, 2025

Uh oh!

Samoed commented Oct 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Samoed commented Oct 25, 2025 •

edited

Loading

KennethEnevoldsen Oct 30, 2025 •

edited

Loading

Samoed Oct 30, 2025 •

edited

Loading