Feat/document utils #893

UmeshpJadhav · 2025-12-26T12:16:51Z

feat: implement document chunking and embedding utilities and integrate into core DocumentRetriever

PR Checklist

Please check if your PR fulfills the following requirements:

The commit message follows our guidelines: https://voltagent.dev/docs/community/contributing/#commit-convention

Bugs / Features

Related issue(s) linked
Tests for the changes have been added
Docs have been added / updated
Changesets have been added https://voltagent.dev/docs/community/contributing/#creating-a-changeset

What is the current behavior?

There are no built-in utilities for document chunking, embedding, or ingestion in the core framework, making it difficult to implement RAG workflows.

What is the new behavior?

This PR introduces a new @voltagent/documents package and integrates it with @voltagent/core.

New Package: packages/documents
RecursiveCharacterTextSplitter: Smart text chunking with overlap.
OpenAIEmbeddingModel: Wrapper for OpenAI embeddings.
DocumentProcessor: Utility to split and embed text.
Core Integration:
-DocumentRetriever: New abstract class in core that adds ingest() capabilities.

fixes #6

Notes for reviewers

The core build may have pre-existing environment issues, but the new test-integration script verifies that the types and exports work correctly.
A new dependency on openai was added to packages/documents.

Summary by cubic

Adds document chunking and embeddings via a new @voltagent/documents package, and integrates ingestion into core DocumentRetriever to enable RAG workflows. Fixes #6.

New Features
- New @voltagent/documents package with RecursiveCharacterTextSplitter, OpenAIEmbeddingModel, and DocumentProcessor.
- Core retriever adds ingest(), default retrieve that embeds queries, empty-array safety, and abstract hooks: upsertDocuments() and queryVectors(). Also exports ProcessedDocument type.
- README and tests for documents utilities.
Migration
- Implement upsertDocuments() and queryVectors() in your retriever to store and search vectors.
- Set OPENAI_API_KEY for embeddings; optionally configure model and chunk sizes (chunkOverlap must be less than chunkSize).
- Ingest with processor.process(text, metadata) or retriever.ingest(text, metadata).

^{Written for commit 112194c. Summary will update on new commits.}

…o core DocumentRetriever

changeset-bot · 2025-12-26T12:16:54Z

🦋 Changeset detected

Latest commit: 112194c

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages

Name	Type
@voltagent/documents	Minor
@voltagent/core	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

cubic-dev-ai

5 issues found across 17 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/documents/package.json">

<violation number="1" location="packages/documents/package.json:35">
P1: Version mismatch: vitest `^1.0.0` conflicts with project-level `^3.2.4`. In a monorepo with syncpack, dependency versions should be consistent across packages to avoid compatibility issues.</violation>

<violation number="2" location="packages/documents/package.json:36">
P2: Version mismatch: @types/node `^20.0.0` conflicts with project-level `^24.2.1`. Consider aligning with the monorepo&#39;s standard version for consistent type definitions.</violation>
</file>

<file name="packages/documents/src/DocumentProcessor.ts">

<violation number="1" location="packages/documents/src/DocumentProcessor.ts:23">
P2: Missing validation that `embeddings.length` matches `chunks.length`. If the embedding model returns fewer embeddings than expected, `embeddings[index]` could be `undefined`, causing silent data corruption in the returned `ProcessedDocument[]`. Consider adding a length validation check after fetching embeddings.</violation>
</file>

<file name="packages/documents/src/text-splitters/TextSplitter.ts">

<violation number="1" location="packages/documents/src/text-splitters/TextSplitter.ts:14">
P2: Missing validation for positive values. The validation checks that `chunkOverlap &lt; chunkSize`, but doesn&#39;t ensure `chunkSize &gt; 0` and `chunkOverlap &gt;= 0`. This allows invalid configurations like zero or negative values to pass.</violation>
</file>

<file name="packages/core/src/retriever/document-retriever.ts">

<violation number="1" location="packages/core/src/retriever/document-retriever.ts:56">
P1: Accessing `input[input.length - 1]` will throw a TypeError if `input` is an empty array. Add a guard to handle this edge case.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Ask questions if you need clarification on any suggestion

_{Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.}

packages/documents/package.json

packages/documents/src/DocumentProcessor.ts

packages/documents/src/text-splitters/TextSplitter.ts

packages/core/src/retriever/document-retriever.ts

- Handle empty array in DocumentRetriever.retrieve - Validate chunkSize and chunkOverlap in TextSplitter

…ent-utils

omeraplak

Hey, thanks, the PR looks great!
I actually started working on the packages/rag package a few weeks ago. Do you think we should include this package as well?

Also, would you like to add it to the docs here?
https://voltagent.dev/docs/rag/overview/

cubic-dev-ai

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="website/docs/rag/overview.md">

<violation number="1" location="website/docs/rag/overview.md:257">
P1: Incorrect API usage in documentation example. `RecursiveChunker` constructor accepts an optional `Tokenizer`, not options. Options like `maxTokens` and `overlapTokens` should be passed to the `chunk()` method. Also, `chunk()` is synchronous, not async.</violation>

<violation number="2" location="website/docs/rag/overview.md:264">
P1: Incorrect API usage in documentation example. `MarkdownChunker` constructor accepts an optional `Tokenizer`, not options. Options like `maxTokens` should be passed to the `chunk()` method. Also, `chunk()` is synchronous, not async.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

website/docs/rag/overview.md

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

UmeshpJadhav · 2025-12-30T12:40:33Z

@omeraplak
Thanks!. I've updated the documentation in docs/rag/overview to include a new section on "Advanced Chunking" that highlights the @voltagent/rag package and its capabilities.

I've also resolved the recent merge conflicts and fixed the linting issues. The PR should be good to go now!

UmeshpJadhav added 2 commits December 26, 2025 17:29

feat: add packages/documents with splitter/embedder and integrate int…

f5a111d

…o core DocumentRetriever

docs: add changeset for document utilities feature

dcb68f4

cubic-dev-ai bot reviewed Dec 26, 2025

View reviewed changes

UmeshpJadhav added 3 commits December 26, 2025 17:53

fix: address PR feedback (validation and edge cases)

9fc5c34

- Handle empty array in DocumentRetriever.retrieve - Validate chunkSize and chunkOverlap in TextSplitter

Merge branch 'main' of github.com:VoltAgent/voltagent into feat/docum…

7eb6e66

…ent-utils

Merge branch 'main' of github.com:VoltAgent/voltagent into feat/docum…

af1b163

…ent-utils

UmeshpJadhav mentioned this pull request Dec 27, 2025

Document Chunking & Embedding Utilities #6

Closed

fix(core): export ProcessedDocument type from retriever module

52142d4

omeraplak approved these changes Dec 30, 2025

View reviewed changes

UmeshpJadhav requested a review from omeraplak December 30, 2025 03:04

UmeshpJadhav added 2 commits December 30, 2025 09:16

fix: resolve merge conflicts and suppress lint warnings

b0ed29e

docs(website): add advanced chunking section to rag overview

5fa09b9

cubic-dev-ai bot reviewed Dec 30, 2025

View reviewed changes

website/docs/rag/overview.md Outdated Show resolved Hide resolved

website/docs/rag/overview.md Outdated Show resolved Hide resolved

UmeshpJadhav and others added 2 commits December 30, 2025 18:09

Update website/docs/rag/overview.md

8fd206e

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

Update website/docs/rag/overview.md

1b58ba3

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

fix: resolve lockfile conflict and format upstream changes

112194c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feat/document utils #893

Feat/document utils #893

Uh oh!

UmeshpJadhav commented Dec 26, 2025 •

edited by cubic-dev-ai bot

Loading

Uh oh!

changeset-bot bot commented Dec 26, 2025 •

edited

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omeraplak left a comment

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

UmeshpJadhav commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Feat/document utils #893

Are you sure you want to change the base?

Feat/document utils #893

Uh oh!

Conversation

UmeshpJadhav commented Dec 26, 2025 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Bugs / Features

What is the current behavior?

What is the new behavior?

Notes for reviewers

Summary by cubic

Uh oh!

changeset-bot bot commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omeraplak left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

UmeshpJadhav commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UmeshpJadhav commented Dec 26, 2025 •

edited by cubic-dev-ai bot

Loading

changeset-bot bot commented Dec 26, 2025 •

edited

Loading