improvement(kb): improve chunkers, respect user-specified chunk configurations, added tests#2539
Merged
waleedlatif1 merged 4 commits intostagingfrom Dec 23, 2025
Merged
improvement(kb): improve chunkers, respect user-specified chunk configurations, added tests#2539waleedlatif1 merged 4 commits intostagingfrom
waleedlatif1 merged 4 commits intostagingfrom
Conversation
…gurations, added tests
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
Contributor
Greptile SummaryThis PR refactors the chunking system to use consistent units (tokens vs characters) and respect user-specified chunk configurations across all chunker types. The changes improve clarity by renaming parameters ( Key improvements:
Issues found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant UI as CreateBaseModal
participant API as /api/knowledge
participant Service as DocumentService
participant Processor as DocumentProcessor
participant Chunker as TextChunker/JsonYamlChunker/StructuredDataChunker
User->>UI: Configure chunking (maxSize, minSize, overlap)
Note over UI: Units: maxSize=tokens, minSize=chars, overlap=tokens
UI->>UI: Validate: minSize < (maxSize × 4)
UI->>API: POST with chunkingConfig
API->>API: Validate with Zod schema
Note over API: maxSize: 100-4000 tokens<br/>minSize: 1-2000 chars<br/>overlap: 0-500 tokens
API->>Service: Create KB with config
User->>Service: Upload document
Service->>Processor: processDocument(chunkSize, chunkOverlap, minCharactersPerChunk)
Note over Processor: Maps config:<br/>maxSize→chunkSize<br/>overlap→chunkOverlap<br/>minSize→minCharactersPerChunk
Processor->>Processor: Detect file type
alt JSON/YAML
Processor->>Chunker: JsonYamlChunker(chunkSize, minCharactersPerChunk)
Chunker->>Chunker: Split by structure, filter by minCharactersPerChunk
else CSV/XLSX
Processor->>Chunker: StructuredDataChunker(chunkSize)
Chunker->>Chunker: Calculate rows/chunk based on chunkSize
else Text/Markdown
Processor->>Chunker: TextChunker(chunkSize, chunkOverlap, minCharactersPerChunk)
Chunker->>Chunker: Clamp overlap to 50% of chunkSize
Chunker->>Chunker: Split hierarchically by separators
Chunker->>Chunker: Add overlap (tokens→chars conversion)
Chunker->>Chunker: Calculate metadata (startIndex, endIndex)
end
Chunker-->>Processor: Return chunks with token counts
Processor-->>Service: Return processed chunks
Service->>Service: Generate embeddings
Service->>Service: Store in vector DB
|
Collaborator
Author
|
@greptile |
This was referenced Dec 23, 2025
waleedlatif1
added a commit
that referenced
this pull request
Dec 23, 2025
…gurations, added tests (#2539) * improvement(kb): improve chunkers, respect user-specified chunk configurations, added tests * ack PR commnets * updated docs * cleanup
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
minCharactersPerChunk,maxChunkSize,chunkOverlapfixes #2510
Type of Change
Testing
Tested manually
Checklist