fix(kb-uploads): created knowledge, chunks, tags services and use redis for queueing docs in kb#1143
Merged
waleedlatif1 merged 18 commits intostagingfrom Aug 27, 2025
Merged
fix(kb-uploads): created knowledge, chunks, tags services and use redis for queueing docs in kb#1143waleedlatif1 merged 18 commits intostagingfrom
waleedlatif1 merged 18 commits intostagingfrom
Conversation
Contributor
There was a problem hiding this comment.
Greptile Summary
This PR implements a comprehensive refactoring of the knowledge base system, moving from in-memory document processing to a Redis-based queueing architecture for better serverless compatibility. The changes create dedicated service layers for knowledge base operations, chunks, and tags, following proper separation of concerns principles.
Key architectural improvements:
- Service Layer Architecture: Extracted complex database operations from API routes into dedicated service modules (
@/lib/knowledge/service.ts,@/lib/knowledge/documents/service.ts,@/lib/knowledge/chunks/service.ts,@/lib/knowledge/tags/service.ts) - Redis Queue Implementation: Added
DocumentProcessingQueueclass with Redis-backed job queuing and fallback to in-memory processing for document processing workflows - File Type Support Expansion: Added parsers for DOC, TXT, and MD files with comprehensive UTF-8 sanitization across all parsers to prevent PostgreSQL encoding issues
- Multipart Upload Support: Implemented batch presigned URL generation and multipart uploads for both S3 and Azure Blob storage to handle large file uploads efficiently
- Enhanced Upload UX: Added real-time progress tracking, file-specific status indicators, and improved error handling in the knowledge base creation modal
Code organization improvements:
- Moved document processing utilities from
@/lib/documents/to@/lib/knowledge/documents/for better domain organization - Consolidated embedding utilities into
@/lib/embeddings/utils - Added comprehensive TypeScript interfaces for all knowledge base operations
- Implemented proper validation for file uploads with centralized file type checking
Infrastructure changes:
- Added
word-extractordependency for DOC file parsing - Updated upload strategies to use batch processing (reduced from 15 to 8 concurrent files)
- Implemented retry mechanisms with exponential backoff for external API calls
- Added comprehensive UTF-8 text sanitization utilities
The refactoring maintains API compatibility while dramatically improving scalability, maintainability, and reliability for knowledge base operations in serverless environments.
Confidence score: 3/5
- This PR requires careful review due to significant architectural changes and complex Redis queue implementation
- Score reflects concerns about error handling in async streaming logic, potential memory leaks in infinite loops, and some missing test coverage for new service layers
- Pay close attention to the Redis queue implementation, service layer error handling, and the removal of some existing test files without replacement
66 files reviewed, 25 comments
* Fix api key auth * Lint
…1146) * fix(condition-block): edges not following blocks, duplicate issues * add subblock update to setActiveWorkflow * Update apps/sim/app/workspace/[workspaceId]/w/[workflowId]/components/workflow-block/components/sub-block/components/condition-input.tsx Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…ad code & consolidate other copilot files (#1147) * cleanup * support azure blob image upload * imports cleanup * PR comments * ack PR comments * fix key validation
#1136) * added forwarding for outlook * lint * improved excel sheet read * addressed greptile * fixed bodytext getting truncated * fixed any type * added html func --------- Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net>
arenadeveloper02
pushed a commit
to arenadeveloper02/p2-sim
that referenced
this pull request
Sep 19, 2025
…is for queueing docs in kb (simstudioai#1143) * improvement(kb): created knowledge, chunks, tags services and use redis for queueing docs in kb * moved directories around * cleanup * bulk create docuemnt records after upload is completed * fix(copilot): send api key to sim agent (simstudioai#1142) * Fix api key auth * Lint * ack PR comments * added sort by functionality for headers in kb table * updated * test fallback from redis, fix styling * cleanup copilot, fixed tooltips * feat: local auto layout (simstudioai#1144) * feat: added llms.txt and robots.txt (simstudioai#1145) * fix(condition-block): edges not following blocks, duplicate issues (simstudioai#1146) * fix(condition-block): edges not following blocks, duplicate issues * add subblock update to setActiveWorkflow * Update apps/sim/app/workspace/[workspaceId]/w/[workflowId]/components/workflow-block/components/sub-block/components/condition-input.tsx Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix dependency array * fix(copilot-cleanup): support azure blob upload in copilot, remove dead code & consolidate other copilot files (simstudioai#1147) * cleanup * support azure blob image upload * imports cleanup * PR comments * ack PR comments * fix key validation * improvement(forwarding+excel): added forwarding and improve excel read (simstudioai#1136) * added forwarding for outlook * lint * improved excel sheet read * addressed greptile * fixed bodytext getting truncated * fixed any type * added html func --------- Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net> * revert agent const * update docs --------- Co-authored-by: Siddharth Ganesan <33737564+Sg312@users.noreply.github.com> Co-authored-by: Emir Karabeg <78010029+emir-karabeg@users.noreply.github.com> Co-authored-by: Vikhyath Mondreti <vikhyathvikku@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Vikhyath Mondreti <vikhyath@simstudio.ai> Co-authored-by: Adam Gough <77861281+aadamgough@users.noreply.github.com> Co-authored-by: Adam Gough <adamgough@Mac.attlocal.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
created knowledge, chunks, tags services and use redis for queueing docs in kb since doing it in-memory in serverless was not the right method
Type of Change
Testing
Tested manually, see below. Added/updated unit tests
Checklist
Screenshots/Videos
Screen.Recording.2025-08-26.at.6.00.25.PM.mov
Screen.Recording.2025-08-26.at.10.11.54.PM.mov