Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WTF? Two versions of QueryEncoder and two versions of AutoQueryEncoder #2009

Open
lintool opened this issue Oct 8, 2024 · 3 comments
Open

Comments

@lintool
Copy link
Member

lintool commented Oct 8, 2024

We have two versions of QueryEncoder and two versions of AutoQueryEncoder. One set of classes is in pyserini.search.faiss, the other set is in pyserini.encode.

(I'm pointing to code at commit just prior to #1997 - because I think that commit breaks a number of things.)

@MXueguang has clearly stated in #1728 that the version in pyserini.encode is actually the one we should use. This is true, because only that version has the option to correctly handle the query prefix, which is needed for BGE to work correctly. However, the QueryEncoder in pyserini.search.faiss is the one that actually works, because only that version downloads query encodings.

So, here's the puzzle: How did we get into a state where we're using AutoQueryEncoder in pyserini.encode but QueryEncoder in pyserini.search.faiss, where the code is so crazily intertwined, and all the regressions pass?

Here's the crazy answer:

In pyserini/search/faiss/__main__.py, this is the import statement:

from pyserini.search import FaissSearcher, BinaryDenseSearcher, TctColBertQueryEncoder, QueryEncoder, \
    DprQueryEncoder, BprQueryEncoder, DkrrDprQueryEncoder, AnceQueryEncoder, AggretrieverQueryEncoder, DenseVectorAveragePrf, \
    DenseVectorRocchioPrf, DenseVectorAncePrf, OpenAIQueryEncoder, ClipQueryEncoder

from pyserini.encode import PcaEncoder, CosDprQueryEncoder, AutoQueryEncoder

So the Faiss searcher is getting most of the models from pyserini.search, and if you trace the imports to pyserini/search/__init__.py, we see the imports "loop back to itself":

from .faiss import DenseSearchResult, PRFDenseSearchResult, FaissSearcher, BinaryDenseSearcher, QueryEncoder, \
    DprQueryEncoder, BprQueryEncoder, DkrrDprQueryEncoder, TctColBertQueryEncoder, AnceQueryEncoder, AggretrieverQueryEncoder, AutoQueryEncoder, ClipQueryEncoder
from .faiss import AnceEncoder
from .faiss import DenseVectorAveragePrf, DenseVectorRocchioPrf, DenseVectorAncePrf
from .faiss import OpenAIQueryEncoder

Which means that for most of the encoder classes, the implementations in pyserini/search/faiss/_searcher.py are used.

Except for AutoQueryEncoder (and CosDprQueryEncoder, but that's an aside).

So now that we understand what's going on, it's probably easier to fix. This also means that #1997 is broken, because it uses the wrong implementation of AutoQueryEncoder.

@MXueguang
Copy link
Member

We actually also need to take care about imports in https://github.com/castorini/pyserini/blob/master/pyserini/encode/query.py
What do you think on this?

@lintool
Copy link
Member Author

lintool commented Oct 8, 2024

Yes, definitely, that will need refactoring. First this though: #2008

@lintool
Copy link
Member Author

lintool commented Oct 15, 2024

Leaving open because we still need to think about what to do with:

  • pyserini/encode/query.py
  • pyserini/encode/__main__.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants