Skip to content

Enhance RRF ranking in HybridContentRetriever with source-aware scoring #345

@haiphucnguyen

Description

@haiphucnguyen

Background

The current HybridContentRetriever.reciprocalRankFusion() applies a uniform RRF formula across all retrieved content:

score(doc) = Σ 1 / (k + rank_i)

This treats results from all knowledge sources equally, regardless of their type (local_folders, local_files, urls) or any source-specific signals such as recency, trust level, or proximity to the query context.

Problem

KnowledgeSourceConfig already models distinct source types via its sealed subclasses:

  • LocalFoldersKnowledgeSourceConfig — watched directories (potentially stale if indexing is delayed)
  • LocalFilesKnowledgeSourceConfig — individual static files
  • UrlKnowledgeSourceConfig — crawled web pages (configurable depth/page count)
    However, reciprocalRankFusion() in HybridContentRetriever.kt discards source metadata entirely. All Content objects are keyed only by their text segment, and RRF scores are computed without distinguishing which source type produced them.

Proposed Improvement

Introduce source-aware score boosting into the RRF algorithm. Concretely:

  • Attach the originating KnowledgeSourceConfig type (or a normalized source weight) to each Content result at retrieval time (e.g., via TextSegment metadata).
  • In reciprocalRankFusion(), apply a configurable multiplier per source type:
adjusted_score = rrf_score * source_weight(sourceType)

Allow source weights to be defined in AppConfig.rag (e.g., sourceWeights: { local_files: 1.2, local_folders: 1.0, urls: 0.8 }).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions