Skip to content

improvement(knowledge): remove innerJoin and add id identifiers to results, updated docs#1170

Merged
waleedlatif1 merged 3 commits intostagingfrom
improvement/kb-search
Aug 29, 2025
Merged

improvement(knowledge): remove innerJoin and add id identifiers to results, updated docs#1170
waleedlatif1 merged 3 commits intostagingfrom
improvement/kb-search

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

split document name retrieval out of vector search and removed innerJoin, costly DB operation that causes us to transform the document record for every query over the embeddings table

Type of Change

  • Other: Performance

Testing

Tested manually.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Aug 28, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
sim Ready Ready Preview Comment Aug 28, 2025 11:45pm
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
docs Skipped Skipped Aug 28, 2025 11:45pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements a significant performance optimization for the knowledge base search functionality by separating document name retrieval from vector search operations. The key architectural change removes expensive innerJoin operations between the embeddings and documents tables, which was causing the system to transform document records for every query over the embeddings table.

The optimization works by splitting the search process into two phases: first performing the vector/tag search to get relevant chunks, then separately fetching document names for only the returned results using a new getDocumentNamesByIds utility function. This approach trades a small amount of additional complexity for significant performance gains, especially important for large knowledge bases.

Additional improvements include standardizing field naming conventions throughout the knowledge base API (idchunkId, id/namedocumentId/documentName) and updating response structures to be more consistent. The changes span multiple files including the core search utilities, API routes, tool configurations, type definitions, and corresponding tests.

The refactoring maintains the same external API contract while fundamentally improving the underlying query efficiency. Database queries now focus on embeddings data without expensive joins, and document metadata is retrieved through optimized batch lookups with deduplication.

Confidence score: 4/5

  • This PR is safe to merge with good confidence as it maintains API compatibility while improving performance
  • Score reflects well-structured performance optimization with proper test coverage and consistent type updates
  • Pay close attention to apps/sim/app/api/knowledge/search/utils.ts for the core database query changes and ensure the new getDocumentNamesByIds function handles edge cases properly

8 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@waleedlatif1 waleedlatif1 merged commit fcf128f into staging Aug 29, 2025
5 checks passed
waleedlatif1 added a commit that referenced this pull request Aug 29, 2025
…sults, updated docs (#1170)

* improvement(knowledge): remove innerJoin and add id identifiers to results, updated docs

* cleanup

* add documentName to upload chunk op as well
@waleedlatif1 waleedlatif1 deleted the improvement/kb-search branch August 29, 2025 05:01
arenadeveloper02 pushed a commit to arenadeveloper02/p2-sim that referenced this pull request Sep 19, 2025
…sults, updated docs (simstudioai#1170)

* improvement(knowledge): remove innerJoin and add id identifiers to results, updated docs

* cleanup

* add documentName to upload chunk op as well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant