-
Notifications
You must be signed in to change notification settings - Fork 841
[MEDI] Pipeline #6993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MEDI] Pipeline #6993
Conversation
src/Libraries/Microsoft.Extensions.DataIngestion/IngestionPipelineOptions.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds telemetry, logging, and error handling capabilities to the data ingestion pipeline. The key enhancement is a new IngestionPipeline class that orchestrates the document reading, chunking, and writing processes with comprehensive observability support.
- Introduces
IngestionPipeline<T>with Activity tracing and logging support - Adds configurable error handling through
IngestionPipelineOptions - Reorganizes test utilities and namespaces for better organization
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| IngestionPipeline.cs | New pipeline orchestrator with telemetry and error handling |
| IngestionPipelineOptions.cs | Configuration options for pipeline error handling and activity source naming |
| Log.cs | Logging message definitions using source generator pattern |
| DiagnosticsConstants.cs | Activity names and tag constants for telemetry |
| Microsoft.Extensions.DataIngestion.csproj | Adds logging and diagnostics dependencies |
| IngestionPipelineTests.cs | Comprehensive tests for pipeline processing and error scenarios |
| TestReader.cs | New test helper for simulating custom reader behavior |
| TestEmbeddingGenerator.cs | Moved to base namespace for reuse across test namespaces |
| VectorStoreWriterTests.cs | Namespace changed to align with folder structure |
| SqliteVectorStoreWriterTests.cs | Namespace changed to align with folder structure |
| InMemoryVectorStoreWriterTests.cs | Namespace changed to align with folder structure |
| Microsoft.Extensions.DataIngestion.Tests.csproj | Adds OpenTelemetry dependency for telemetry testing |
Comments suppressed due to low confidence (1)
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/IngestionPipelineTests.cs:1
- The condition uses
>=but should use>when comparing withMaximumErrorsPerProcessing. With the current logic, ifMaximumErrorsPerProcessingis set to 3, processing will stop after 3 errors (indices 0,1,2), but the documentation states 'maximum number of ingestions that are allowed to fail', which should allow 3 failures before throwing on the 4th.
// Licensed to the .NET Foundation under one or more agreements.
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/IngestionPipelineTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/IngestionPipelineTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/IngestionPipelineTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/IngestionPipelineTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/IngestionPipelineTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/IngestionPipelineTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/IngestionPipelineTests.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
src/Libraries/Microsoft.Extensions.DataIngestion/IngestionPipeline.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/IngestionPipeline.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/IngestionPipeline.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/IngestionPipeline.cs
Outdated
Show resolved
Hide resolved
# Conflicts: # src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj
src/Libraries/Microsoft.Extensions.DataIngestion/IngestionPipeline.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/DiagnosticsConstants.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/IngestionPipeline.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/IngestionPipeline.cs
Outdated
Show resolved
Hide resolved
…g a single DocumentProcessors make Pipeline return a collection of the results, so the users can handle everything as they want to
The Pipeline itself is rather simple. The tricky part is exception handling: should the pipeline fail when a processing of a single file has failed? Initially I wanted to introduce something like
ContinueOnError, but when I search the repo and foundMaximumConsecutiveErrorsPerRequestand got inspired with that idea (for better or worse).Microsoft Reviewers: Open in CodeFlow