Skip to content

feat: implement Reciprocal Rank Fusion (RRF) for hybrid search ranking#93

Open
zohaib-7035 wants to merge 3 commits intoINCF:mainfrom
zohaib-7035:feature/hybrid-search-rrf
Open

feat: implement Reciprocal Rank Fusion (RRF) for hybrid search ranking#93
zohaib-7035 wants to merge 3 commits intoINCF:mainfrom
zohaib-7035:feature/hybrid-search-rrf

Conversation

@zohaib-7035
Copy link

What does this PR do?

Implements Reciprocal Rank Fusion (RRF) to seamlessly combine keyword search results (Elasticsearch BM25) and semantic vector search results (Embeddings) into a single, high-quality ranked list. This addresses Issue #13 and aligns directly with the GSoC 2026 "Advanced RAG" project goals.

Why is this necessary?

Previously, the fuse_results logic merged the raw scores using a hardcoded weight formula (0.6 × similarity + 0.4 × keyword_score). Because these two scoring methods operate on completely different mathematical scales and distributions, the rankings were heavily skewed.

RRF resolves this by relying on pure positional ranking. Instead of comparing arbitrary scores, it algorithmically grades datasets based on their rank placement (1 / (60 + rank)). This penalizes datasets that only match loosely in one system, and massively boosts datasets that appear highly ranked in both keyword and semantic contexts.

How was it implemented?

  1. Core Algorithm (backend/rrf.py): Created a standalone pure mathematical implementation of the RRF algorithm utilizing the standard smoothing constant k=60. Includes safe ID extraction logic to handle differences between raw ES hits and Vertex AI vectors.
  2. Seamless Integration (backend/agents.py): Substituted the legacy score-merging loop in fuse_results with the new RRF module, leaving the outer API surface entirely unchanged.
  3. Rigorous Testing (backend/tests/test_rrf.py): Added a comprehensive pytest test suite ensuring correct mathematical ranking, cross-list overlap boosting, and edge case safety (empty lists, unaligned IDs).
  4. Code Quality: Verified passing execution of mypy and flake8 to adhere strictly to Python typing and PEP8 conventions.

Copy link
Collaborator

@QuantumByte-01 QuantumByte-01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RRF algorithm in rrf.py is correctly implemented and the tests are solid. Three issues to fix:

1. Move import to top of file (minor)
from rrf import reciprocal_rank_fusion is placed mid-file after execute_search. Move it to the top with the other imports.

2. ID overwrite in global_fuzzy_keyword_search breaks RRF (critical)

for i, item in enumerate(out):
    item["_id"] = f"fuzzy_{i}"
    item["id"] = f"fuzzy_{i}"

This overwrites the real dataset IDs with sequential placeholders. RRF boosts documents that appear in multiple lists — but with fuzzy_0, fuzzy_1 IDs, no fuzzy result will ever match a vector or KS result by ID. The cross-list boosting — the entire point of RRF — is disabled. Remove these lines and keep the original IDs from the API response.

3. global_fuzzy_keyword_search rewrite eliminates a distinct data source
The original implementation used datasources_config.json for field-value fuzzy matching — a genuinely different retrieval path from the public API. The new version just calls general_search() with OR-joined keywords, which is nearly identical to what KSSearchAgent already does via general_search_async(). This results in two near-duplicate API calls and loses the local structured search entirely. Please justify this change or restore the original approach alongside the API call.

…ove ID overwrite, restore local fuzzy search
@zohaib-7035
Copy link
Author

Hi @QuantumByte-01 ,
I’ve addressed the issues in this PR. Could you please review it again and let me know if anything is still missing?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants