feat: implement Reciprocal Rank Fusion (RRF) for hybrid search ranking#93
feat: implement Reciprocal Rank Fusion (RRF) for hybrid search ranking#93zohaib-7035 wants to merge 3 commits intoINCF:mainfrom
Conversation
QuantumByte-01
left a comment
There was a problem hiding this comment.
The RRF algorithm in rrf.py is correctly implemented and the tests are solid. Three issues to fix:
1. Move import to top of file (minor)
from rrf import reciprocal_rank_fusion is placed mid-file after execute_search. Move it to the top with the other imports.
2. ID overwrite in global_fuzzy_keyword_search breaks RRF (critical)
for i, item in enumerate(out):
item["_id"] = f"fuzzy_{i}"
item["id"] = f"fuzzy_{i}"This overwrites the real dataset IDs with sequential placeholders. RRF boosts documents that appear in multiple lists — but with fuzzy_0, fuzzy_1 IDs, no fuzzy result will ever match a vector or KS result by ID. The cross-list boosting — the entire point of RRF — is disabled. Remove these lines and keep the original IDs from the API response.
3. global_fuzzy_keyword_search rewrite eliminates a distinct data source
The original implementation used datasources_config.json for field-value fuzzy matching — a genuinely different retrieval path from the public API. The new version just calls general_search() with OR-joined keywords, which is nearly identical to what KSSearchAgent already does via general_search_async(). This results in two near-duplicate API calls and loses the local structured search entirely. Please justify this change or restore the original approach alongside the API call.
…ove ID overwrite, restore local fuzzy search
|
Hi @QuantumByte-01 , |
What does this PR do?
Implements Reciprocal Rank Fusion (RRF) to seamlessly combine keyword search results (Elasticsearch BM25) and semantic vector search results (Embeddings) into a single, high-quality ranked list. This addresses Issue #13 and aligns directly with the GSoC 2026 "Advanced RAG" project goals.
Why is this necessary?
Previously, the fuse_results logic merged the raw scores using a hardcoded weight formula (
0.6 × similarity + 0.4 × keyword_score). Because these two scoring methods operate on completely different mathematical scales and distributions, the rankings were heavily skewed.RRF resolves this by relying on pure positional ranking. Instead of comparing arbitrary scores, it algorithmically grades datasets based on their rank placement (
1 / (60 + rank)). This penalizes datasets that only match loosely in one system, and massively boosts datasets that appear highly ranked in both keyword and semantic contexts.How was it implemented?
k=60. Includes safe ID extraction logic to handle differences between raw ES hits and Vertex AI vectors.pytesttest suite ensuring correct mathematical ranking, cross-list overlap boosting, and edge case safety (empty lists, unaligned IDs).mypyandflake8to adhere strictly to Python typing and PEP8 conventions.