-
Notifications
You must be signed in to change notification settings - Fork 841
Introduce HeaderChunker #6979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce HeaderChunker #6979
Conversation
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements the HeaderChunker class for splitting documents into chunks based on headers while preserving context, along with supporting infrastructure for token-based chunking. The chunker respects token limits and handles various document elements including paragraphs, tables, and images.
- Adds
HeaderChunkerandElementsChunkerfor document chunking with tokenization support - Introduces
IngestionChunkerOptionsfor configurable chunk sizes and overlap - Adds validation for
IngestionDocumentHeader.Levelproperty (0-10 range)
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/HeaderChunker.cs |
New chunker implementation that splits documents by headers and delegates element chunking |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs |
Internal chunker that handles token-based splitting of document elements including tables and paragraphs |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/IngestionChunkerOptions.cs |
Configuration class for chunker behavior with validation logic |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ValueStringBuilder.cs |
Copied internal type from dotnet/runtime for efficient string building with array pooling |
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentElement.cs |
Adds validation to IngestionDocumentHeader.Level property |
src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj |
Adds dependency on Microsoft.ML.Tokenizers package |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Utils/IAsyncEnumerableExtensions.cs |
Adds helper method ToListAsync for test assertions |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/HeaderChunkerTests.cs |
Comprehensive tests for HeaderChunker including edge cases for tables and token limits |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkerOptionsTests.cs |
Tests for IngestionChunkerOptions validation logic |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/IngestionDocumentTests.cs |
Adds test for header level validation |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Microsoft.Extensions.DataIngestion.Tests.csproj |
Adds test dependency on Microsoft.ML.Tokenizers.Data.Cl100kBase |
eng/packages/TestOnly.props |
Registers package version for tokenizer test data |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/IngestionChunkerOptions.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentElement.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/HeaderChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/HeaderChunker.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Utils/IAsyncEnumerableExtensions.cs
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/HeaderChunkerTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkerOptionsTests.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/IngestionChunkerOptions.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentElement.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/IngestionChunkerOptions.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/ElementsChunker.cs
Show resolved
Hide resolved
- use raw string literals as indentation is controlled by positioning the closing quotes - don't allow for headers with level == 0
|
@stephentoub I believe I have addressed all your feedback. Could you PTAL? |
Microsoft Reviewers: Open in CodeFlow