feat: implement Reciprocal Rank Fusion (RRF) for hybrid search ranking by zohaib-7035 · Pull Request #93 · INCF/knowledge-space-agent

zohaib-7035 · 2026-03-16T19:34:04Z

What does this PR do?

Implements Reciprocal Rank Fusion (RRF) to seamlessly combine keyword search results (Elasticsearch BM25) and semantic vector search results (Embeddings) into a single, high-quality ranked list. This addresses Issue #13 and aligns directly with the GSoC 2026 "Advanced RAG" project goals.

Why is this necessary?

Previously, the fuse_results logic merged the raw scores using a hardcoded weight formula (0.6 × similarity + 0.4 × keyword_score). Because these two scoring methods operate on completely different mathematical scales and distributions, the rankings were heavily skewed.

RRF resolves this by relying on pure positional ranking. Instead of comparing arbitrary scores, it algorithmically grades datasets based on their rank placement (1 / (60 + rank)). This penalizes datasets that only match loosely in one system, and massively boosts datasets that appear highly ranked in both keyword and semantic contexts.

How was it implemented?

Core Algorithm (backend/rrf.py): Created a standalone pure mathematical implementation of the RRF algorithm utilizing the standard smoothing constant k=60. Includes safe ID extraction logic to handle differences between raw ES hits and Vertex AI vectors.
Seamless Integration (backend/agents.py): Substituted the legacy score-merging loop in fuse_results with the new RRF module, leaving the outer API surface entirely unchanged.
Rigorous Testing (backend/tests/test_rrf.py): Added a comprehensive pytest test suite ensuring correct mathematical ranking, cross-list overlap boosting, and edge case safety (empty lists, unaligned IDs).
Code Quality: Verified passing execution of mypy and flake8 to adhere strictly to Python typing and PEP8 conventions.

QuantumByte-01

The RRF algorithm in rrf.py is correctly implemented and the tests are solid. Three issues to fix:

1. Move import to top of file (minor)
from rrf import reciprocal_rank_fusion is placed mid-file after execute_search. Move it to the top with the other imports.

2. ID overwrite in global_fuzzy_keyword_search breaks RRF (critical)

for i, item in enumerate(out):
    item["_id"] = f"fuzzy_{i}"
    item["id"] = f"fuzzy_{i}"

This overwrites the real dataset IDs with sequential placeholders. RRF boosts documents that appear in multiple lists — but with fuzzy_0, fuzzy_1 IDs, no fuzzy result will ever match a vector or KS result by ID. The cross-list boosting — the entire point of RRF — is disabled. Remove these lines and keep the original IDs from the API response.

3. global_fuzzy_keyword_search rewrite eliminates a distinct data source
The original implementation used datasources_config.json for field-value fuzzy matching — a genuinely different retrieval path from the public API. The new version just calls general_search() with OR-joined keywords, which is nearly identical to what KSSearchAgent already does via general_search_async(). This results in two near-duplicate API calls and loses the local structured search entirely. Please justify this change or restore the original approach alongside the API call.

…ove ID overwrite, restore local fuzzy search

zohaib-7035 · 2026-03-20T07:30:16Z

Hi @QuantumByte-01 ,
I’ve addressed the issues in this PR. Could you please review it again and let me know if anything is still missing?
Thanks!

Zohaib added 2 commits March 17, 2026 00:22

feat: implement Reciprocal Rank Fusion (RRF) for hybrid search ranking

707e616

fix: remove api key from template and implement real keyword search

b8d6a40

QuantumByte-01 requested changes Mar 19, 2026

View reviewed changes

fix: address PR INCF#93 review feedback - move RRF import to top, rem…

7304f3b

…ove ID overwrite, restore local fuzzy search

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement Reciprocal Rank Fusion (RRF) for hybrid search ranking#93

feat: implement Reciprocal Rank Fusion (RRF) for hybrid search ranking#93
zohaib-7035 wants to merge 3 commits intoINCF:mainfrom
zohaib-7035:feature/hybrid-search-rrf

zohaib-7035 commented Mar 16, 2026

Uh oh!

QuantumByte-01 left a comment

Uh oh!

zohaib-7035 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zohaib-7035 commented Mar 16, 2026

What does this PR do?

Why is this necessary?

How was it implemented?

Uh oh!

QuantumByte-01 left a comment

Choose a reason for hiding this comment

Uh oh!

zohaib-7035 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants