improvement(knowledge): remove innerJoin and add id identifiers to results, updated docs#1170
improvement(knowledge): remove innerJoin and add id identifiers to results, updated docs#1170waleedlatif1 merged 3 commits intostagingfrom
Conversation
There was a problem hiding this comment.
Greptile Summary
This PR implements a significant performance optimization for the knowledge base search functionality by separating document name retrieval from vector search operations. The key architectural change removes expensive innerJoin operations between the embeddings and documents tables, which was causing the system to transform document records for every query over the embeddings table.
The optimization works by splitting the search process into two phases: first performing the vector/tag search to get relevant chunks, then separately fetching document names for only the returned results using a new getDocumentNamesByIds utility function. This approach trades a small amount of additional complexity for significant performance gains, especially important for large knowledge bases.
Additional improvements include standardizing field naming conventions throughout the knowledge base API (id → chunkId, id/name → documentId/documentName) and updating response structures to be more consistent. The changes span multiple files including the core search utilities, API routes, tool configurations, type definitions, and corresponding tests.
The refactoring maintains the same external API contract while fundamentally improving the underlying query efficiency. Database queries now focus on embeddings data without expensive joins, and document metadata is retrieved through optimized batch lookups with deduplication.
Confidence score: 4/5
- This PR is safe to merge with good confidence as it maintains API compatibility while improving performance
- Score reflects well-structured performance optimization with proper test coverage and consistent type updates
- Pay close attention to
apps/sim/app/api/knowledge/search/utils.tsfor the core database query changes and ensure the newgetDocumentNamesByIdsfunction handles edge cases properly
8 files reviewed, no comments
…sults, updated docs (#1170) * improvement(knowledge): remove innerJoin and add id identifiers to results, updated docs * cleanup * add documentName to upload chunk op as well
…sults, updated docs (simstudioai#1170) * improvement(knowledge): remove innerJoin and add id identifiers to results, updated docs * cleanup * add documentName to upload chunk op as well
Summary
split document name retrieval out of vector search and removed innerJoin, costly DB operation that causes us to transform the document record for every query over the embeddings table
Type of Change
Testing
Tested manually.
Checklist