-
Notifications
You must be signed in to change notification settings - Fork 841
[MEDI] Don't stop document processing on enricher error #7005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors AI-powered data ingestion enrichers to support batch processing and use structured outputs. The changes improve performance by processing multiple chunks in a single API call and enhance type safety by using the GetResponseAsync<T> generic method for structured JSON responses.
Key changes:
- Introduces
EnricherOptionsclass to encapsulate chat client configuration and batch size settings - Refactors all enrichers (Summary, Sentiment, Keyword, Classification, ImageAlternativeText) to process items in batches instead of one-by-one
- Adopts structured output pattern using
Envelope<T>wrapper for JSON deserialization - Removes response validation logic that checked for specific formats (e.g., semicolon-separated keywords, predefined sentiment values)
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| EnricherOptions.cs | New options class consolidating chat client, chat options, and batch size configuration |
| BufferOperator.cs | Utility for buffering async enumerables into batches |
| Envelope{T}.cs | Test helper for JSON deserialization of structured AI responses |
| SummaryEnricher.cs | Refactored to use EnricherOptions and batch processing |
| SentimentEnricher.cs | Refactored to use EnricherOptions and batch processing |
| KeywordEnricher.cs | Refactored to use EnricherOptions and batch processing; removed delimiter validation |
| ClassificationEnricher.cs | Refactored to use EnricherOptions and batch processing; removed comma validation |
| ImageAlternativeTextEnricher.cs | Refactored to use EnricherOptions and batch processing for images |
| SummaryEnricherTests.cs | Updated tests to use new API and structured output format |
| SentimentEnricherTests.cs | Updated tests to use new API; removed invalid response test |
| KeywordEnricherTests.cs | Updated tests to use new API; removed invalid response and illegal character tests |
| ClassificationEnricherTests.cs | Updated tests to use new API; removed invalid response and comma validation tests |
| AlternativeTextEnricherTests.cs | Updated tests to use new API; added batch size test coverage |
| Microsoft.Extensions.DataIngestion.csproj | Added dependency on Microsoft.Extensions.AI project |
src/Libraries/Microsoft.Extensions.DataIngestion/Processors/EnricherOptions.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Utils/BufferOperator.cs
Outdated
Show resolved
Hide resolved
# Conflicts: # src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj
…tream in the future. Just expose the document id
…ey are best-effort enhancements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
src/Libraries/Microsoft.Extensions.DataIngestion/Processors/EnricherOptions.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Utils/BufferOperator.cs
Outdated
Show resolved
Hide resolved
* introduce EnricherOptions option bag * implement batching * don't validate results returned by IChatClient * don't expose FileInfo as source via IngestionResult, as it could be Stream in the future. Just expose the document id * Enricher failures should not fail the whole ingestion pipeline, as they are best-effort enhancements
The following PRs are included in this backport: - [MEDI] start producing NuGet packages (dotnet/extensions/dotnet#7016) - Update version numbers in AI changelogs (dotnet/extensions/dotnet#7008) - [MEDI] Don't stop document processing on enricher error (dotnet/extensions/dotnet#7005) - [MEDI] add PackageTags (dotnet/extensions/dotnet#7022) - Add MarkItDownMcpReader for MCP server support (dotnet/extensions/dotnet#7025) - Image generation tool (dotnet/extensions/dotnet#6749) - Make MEAI packages use 10.0 runtime packages (dotnet/extensions/dotnet#7028) ---- #### AI description (iteration 1) #### PR Classification This pull request backports multiple MEAI library updates, including new image generation features, refactoring of data ingestion enrichers, removal of legacy exporter code, and updated OpenTelemetry instrumentation. #### PR Summary The changes integrate new image generation tool support into chat clients with corresponding types and integration tests, refactor data ingestion enrichers to use a unified `EnricherOptions` abstraction with batching, and remove outdated JSON schema exporter and nullability helper files while updating OpenTelemetry metrics and project metadata. - `src/Libraries/Microsoft.Extensions.AI`: Added new types (`HostedImageGenerationTool.cs`, `ImageGenerationToolCallContent.cs`, `ImageGenerationToolResultContent.cs`) and integration tests to enable hosted image generation across AI providers. - `src/Libraries/Microsoft.Extensions.DataIngestion`: Refactored enrichers (Sentiment, Keyword, Classification, Summary) to use the new `EnricherOptions` and batching via the `Batching.cs` utility, with updated tests. - Removed legacy schema exporter files (e.g. files under `src/Shared/JsonSchemaExporter/` and `NullabilityInfoContext/`) to clean up unused functionality. - Updated OpenTelemetry instrumentation in OpenAI, Azure AI, Embedding, and SpeechToText clients to align with the latest semantic conventions. - Revised project and package configuration files with updated metadata, preview stage tags, and code quality settings. <!-- GitOpsUserAgent=GitOps.Apps.Server.pullrequestcopilot -->
This PR implements two changes:
fixes #6983
fixes #6984