Skip to content

[ROADMAP][1.2] Improve retrieval by combining chunk and document search with RRF #550

@MODSetter

Description

@MODSetter

Feature Description

Enhance the retrieval system by combining chunk-level hybrid search and document-level hybrid search using Reciprocal Rank Fusion (RRF), then fetching whole documents based on the most relevant chunks. This creates a more comprehensive retrieval pipeline.

Target Deployment

  • SurfSense Cloud (hosted version)
  • Self-hosted version

Problem Statement

Currently, chunk-level and document-level searches operate somewhat independently. This can lead to:

  • Missing context when only chunks are returned
  • Suboptimal ranking when document and chunk relevance aren't combined
  • Incomplete information for the LLM to generate comprehensive answers

Proposed Solution

  1. Perform dual hybrid search: Execute both chunk-level and document-level hybrid searches
  2. Apply RRF fusion: Combine results using Reciprocal Rank Fusion to create a unified ranking
  3. Fetch whole documents: Retrieve complete documents for the top-ranked chunks
  4. Preserve chunk metadata: Maintain chunk boundaries and positions within documents for citation purposes

RRF Formula

RRF_score(d) = Σ (1 / (k + rank_i(d)))

Where k is typically 60 and rank_i(d) is the rank of document d in the i-th result list.

Benefits

  • More comprehensive context for LLM responses
  • Better relevance through multi-signal ranking
  • Improved citation accuracy with full document context
  • Foundation for advanced RAG techniques

Use Case Examples

  1. User asks a complex question requiring information spread across multiple sections of a document
  2. Research query where chunk-level matches indicate document relevance
  3. Citation-heavy responses needing full document context

Implementation Considerations

  • This may require backend changes (retriever pipeline)
  • This may require database changes
  • This may affect existing features (response generation, citations)

Files Likely Affected

  • surfsense_backend/app/retriever/ - Retriever implementations
  • surfsense_backend/app/agents/ - Agent retrieval logic
  • surfsense_backend/app/services/ - Search service layer

Acceptance Criteria

  • Chunk and document hybrid searches execute in parallel
  • RRF fusion produces a unified ranked list
  • Whole documents are fetched for top-N results
  • Chunk positions within documents are preserved
  • Performance is acceptable (consider caching strategies)
  • API returns enriched document objects with chunk metadata

Technical Notes

  • Consider async parallel execution of chunk and document searches
  • Implement configurable RRF k parameter
  • Add metrics/logging for fusion quality analysis
  • Consider document deduplication in final results

Related Issues

  • Depends on: Issue 1.1 (Time-based filtering)
  • Blocks: Issue 1.3 (Citation prompt updates)

Metadata

Metadata

Assignees

Labels

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions