Skip to content

Conversation

@UmeshpJadhav
Copy link

@UmeshpJadhav UmeshpJadhav commented Dec 26, 2025

feat: implement document chunking and embedding utilities and integrate into core DocumentRetriever

PR Checklist

Please check if your PR fulfills the following requirements:

Bugs / Features

What is the current behavior?

There are no built-in utilities for document chunking, embedding, or ingestion in the core framework, making it difficult to implement RAG workflows.

What is the new behavior?

This PR introduces a new @voltagent/documents package and integrates it with @voltagent/core.

  • New Package: packages/documents
  • RecursiveCharacterTextSplitter: Smart text chunking with overlap.
  • OpenAIEmbeddingModel: Wrapper for OpenAI embeddings.
  • DocumentProcessor: Utility to split and embed text.
  • Core Integration:
    -DocumentRetriever: New abstract class in core that adds ingest() capabilities.

fixes #6

Notes for reviewers

  • The core build may have pre-existing environment issues, but the new test-integration script verifies that the types and exports work correctly.
  • A new dependency on openai was added to packages/documents.

Summary by cubic

Adds document chunking and embeddings via a new @voltagent/documents package, and integrates ingestion into core DocumentRetriever to enable RAG workflows. Fixes #6.

  • New Features

    • New @voltagent/documents package with RecursiveCharacterTextSplitter, OpenAIEmbeddingModel, and DocumentProcessor.
    • Core retriever adds ingest(), default retrieve that embeds queries, empty-array safety, and abstract hooks: upsertDocuments() and queryVectors(). Also exports ProcessedDocument type.
    • README and tests for documents utilities.
  • Migration

    • Implement upsertDocuments() and queryVectors() in your retriever to store and search vectors.
    • Set OPENAI_API_KEY for embeddings; optionally configure model and chunk sizes (chunkOverlap must be less than chunkSize).
    • Ingest with processor.process(text, metadata) or retriever.ingest(text, metadata).

Written for commit 112194c. Summary will update on new commits.

@changeset-bot
Copy link

changeset-bot bot commented Dec 26, 2025

🦋 Changeset detected

Latest commit: 112194c

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@voltagent/documents Minor
@voltagent/core Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 17 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="packages/documents/package.json">

<violation number="1" location="packages/documents/package.json:35">
P1: Version mismatch: vitest `^1.0.0` conflicts with project-level `^3.2.4`. In a monorepo with syncpack, dependency versions should be consistent across packages to avoid compatibility issues.</violation>

<violation number="2" location="packages/documents/package.json:36">
P2: Version mismatch: @types/node `^20.0.0` conflicts with project-level `^24.2.1`. Consider aligning with the monorepo&#39;s standard version for consistent type definitions.</violation>
</file>

<file name="packages/documents/src/DocumentProcessor.ts">

<violation number="1" location="packages/documents/src/DocumentProcessor.ts:23">
P2: Missing validation that `embeddings.length` matches `chunks.length`. If the embedding model returns fewer embeddings than expected, `embeddings[index]` could be `undefined`, causing silent data corruption in the returned `ProcessedDocument[]`. Consider adding a length validation check after fetching embeddings.</violation>
</file>

<file name="packages/documents/src/text-splitters/TextSplitter.ts">

<violation number="1" location="packages/documents/src/text-splitters/TextSplitter.ts:14">
P2: Missing validation for positive values. The validation checks that `chunkOverlap &lt; chunkSize`, but doesn&#39;t ensure `chunkSize &gt; 0` and `chunkOverlap &gt;= 0`. This allows invalid configurations like zero or negative values to pass.</violation>
</file>

<file name="packages/core/src/retriever/document-retriever.ts">

<violation number="1" location="packages/core/src/retriever/document-retriever.ts:56">
P1: Accessing `input[input.length - 1]` will throw a TypeError if `input` is an empty array. Add a guard to handle this edge case.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.

Copy link
Member

@omeraplak omeraplak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks, the PR looks great!
I actually started working on the packages/rag package a few weeks ago. Do you think we should include this package as well?

Also, would you like to add it to the docs here?
https://voltagent.dev/docs/rag/overview/

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="website/docs/rag/overview.md">

<violation number="1" location="website/docs/rag/overview.md:257">
P1: Incorrect API usage in documentation example. `RecursiveChunker` constructor accepts an optional `Tokenizer`, not options. Options like `maxTokens` and `overlapTokens` should be passed to the `chunk()` method. Also, `chunk()` is synchronous, not async.</violation>

<violation number="2" location="website/docs/rag/overview.md:264">
P1: Incorrect API usage in documentation example. `MarkdownChunker` constructor accepts an optional `Tokenizer`, not options. Options like `maxTokens` should be passed to the `chunk()` method. Also, `chunk()` is synchronous, not async.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

UmeshpJadhav and others added 2 commits December 30, 2025 18:09
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
@UmeshpJadhav
Copy link
Author

@omeraplak
Thanks!. I've updated the documentation in docs/rag/overview to include a new section on "Advanced Chunking" that highlights the @voltagent/rag package and its capabilities.

I've also resolved the recent merge conflicts and fixed the linting issues. The PR should be good to go now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document Chunking & Embedding Utilities

2 participants