Skip to content

Conversation

@adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Nov 3, 2025

This PR implements two changes:

  • batching of the LLM requests
  • logging and ignoring enricher failures (so the pipeline does not stop processing document ingestion)

fixes #6983
fixes #6984

Copilot AI review requested due to automatic review settings November 3, 2025 16:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors AI-powered data ingestion enrichers to support batch processing and use structured outputs. The changes improve performance by processing multiple chunks in a single API call and enhance type safety by using the GetResponseAsync<T> generic method for structured JSON responses.

Key changes:

  • Introduces EnricherOptions class to encapsulate chat client configuration and batch size settings
  • Refactors all enrichers (Summary, Sentiment, Keyword, Classification, ImageAlternativeText) to process items in batches instead of one-by-one
  • Adopts structured output pattern using Envelope<T> wrapper for JSON deserialization
  • Removes response validation logic that checked for specific formats (e.g., semicolon-separated keywords, predefined sentiment values)

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
EnricherOptions.cs New options class consolidating chat client, chat options, and batch size configuration
BufferOperator.cs Utility for buffering async enumerables into batches
Envelope{T}.cs Test helper for JSON deserialization of structured AI responses
SummaryEnricher.cs Refactored to use EnricherOptions and batch processing
SentimentEnricher.cs Refactored to use EnricherOptions and batch processing
KeywordEnricher.cs Refactored to use EnricherOptions and batch processing; removed delimiter validation
ClassificationEnricher.cs Refactored to use EnricherOptions and batch processing; removed comma validation
ImageAlternativeTextEnricher.cs Refactored to use EnricherOptions and batch processing for images
SummaryEnricherTests.cs Updated tests to use new API and structured output format
SentimentEnricherTests.cs Updated tests to use new API; removed invalid response test
KeywordEnricherTests.cs Updated tests to use new API; removed invalid response and illegal character tests
ClassificationEnricherTests.cs Updated tests to use new API; removed invalid response and comma validation tests
AlternativeTextEnricherTests.cs Updated tests to use new API; added batch size test coverage
Microsoft.Extensions.DataIngestion.csproj Added dependency on Microsoft.Extensions.AI project

# Conflicts:
#	src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj
…tream in the future. Just expose the document id
@adamsitnik adamsitnik changed the title [MEDI] Don't validate results returned by IChatClient [MEDI] Don't stop document processing on enricher error Nov 5, 2025
@adamsitnik adamsitnik requested a review from Copilot November 5, 2025 18:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

@adamsitnik adamsitnik enabled auto-merge (squash) November 5, 2025 20:19
@adamsitnik adamsitnik merged commit ca4fc52 into dotnet:main Nov 6, 2025
11 of 12 checks passed
@adamsitnik adamsitnik deleted the dontValidate branch November 6, 2025 17:36
joperezr pushed a commit to joperezr/extensions that referenced this pull request Nov 11, 2025
* introduce EnricherOptions option bag

* implement batching

* don't validate results returned by IChatClient

* don't expose FileInfo as source via IngestionResult, as it could be Stream in the future. Just expose the document id

* Enricher failures should not fail the whole ingestion pipeline, as they are best-effort enhancements
joperezr pushed a commit to joperezr/extensions that referenced this pull request Nov 11, 2025
The following PRs are included in this backport:

- [MEDI] start producing NuGet packages (dotnet/extensions/dotnet#7016)
- Update version numbers in AI changelogs (dotnet/extensions/dotnet#7008)
- [MEDI] Don't stop document processing on enricher error (dotnet/extensions/dotnet#7005)
- [MEDI] add PackageTags (dotnet/extensions/dotnet#7022)
- Add MarkItDownMcpReader for MCP server support (dotnet/extensions/dotnet#7025)
- Image generation tool (dotnet/extensions/dotnet#6749)
- Make MEAI packages use 10.0 runtime packages (dotnet/extensions/dotnet#7028)

----
#### AI description  (iteration 1)
#### PR Classification
This pull request backports multiple MEAI library updates, including new image generation features, refactoring of data ingestion enrichers, removal of legacy exporter code, and updated OpenTelemetry instrumentation.

#### PR Summary
The changes integrate new image generation tool support into chat clients with corresponding types and integration tests, refactor data ingestion enrichers to use a unified `EnricherOptions` abstraction with batching, and remove outdated JSON schema exporter and nullability helper files while updating OpenTelemetry metrics and project metadata.
- `src/Libraries/Microsoft.Extensions.AI`: Added new types (`HostedImageGenerationTool.cs`, `ImageGenerationToolCallContent.cs`, `ImageGenerationToolResultContent.cs`) and integration tests to enable hosted image generation across AI providers.
- `src/Libraries/Microsoft.Extensions.DataIngestion`: Refactored enrichers (Sentiment, Keyword, Classification, Summary) to use the new `EnricherOptions` and batching via the `Batching.cs` utility, with updated tests.
- Removed legacy schema exporter files (e.g. files under `src/Shared/JsonSchemaExporter/` and `NullabilityInfoContext/`) to clean up unused functionality.
- Updated OpenTelemetry instrumentation in OpenAI, Azure AI, Embedding, and SpeechToText clients to align with the latest semantic conventions.
- Revised project and package configuration files with updated metadata, preview stage tags, and code quality settings.
<!-- GitOpsUserAgent=GitOps.Apps.Server.pullrequestcopilot -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MEDI] Implement batching for enrichers [MEDI] Establish pattern for handling LLM failures

2 participants