Skip to content

Conversation

@MackinnonBuck
Copy link
Member

@MackinnonBuck MackinnonBuck commented Nov 6, 2025

This PR makes the following changes to the chat template:

  1. Adds support for Markdown documents
    • Replaces Example_GPS_Watch.pdf with its markdown equivalent
    • Adds a Markdown viewer to view citations
  2. Removes the dependency on PdfPig
    • ...but now has an implicit dependency on markitdown

@github-actions github-actions bot added the area-ai-templates Microsoft.Extensions.AI.Templates label Nov 6, 2025
@MackinnonBuck
Copy link
Member Author

Marking as ready for review to get some eyes on this. Note that there are still pending improvements:

  1. Better UX for document ingestion
  2. Possibly re-introducing PdfPig dependency to replace the markitdown dependency (see Use Microsoft.Extensions.DataIngestion in AI Chat Web template #7023 (comment))

@MackinnonBuck MackinnonBuck marked this pull request as ready for review November 7, 2025 00:40
@MackinnonBuck MackinnonBuck requested a review from a team as a code owner November 7, 2025 00:40
Copilot AI review requested due to automatic review settings November 7, 2025 00:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR modernizes the AI chat web templates by replacing the custom PDF ingestion pipeline (using PdfPig) with the new Microsoft.Extensions.DataIngestion library suite. The changes enable support for both Markdown and PDF document formats while simplifying the ingestion architecture.

Key Changes

  • Replaced custom PDFDirectorySource and IIngestionSource with the standardized Microsoft.Extensions.DataIngestion APIs
  • Removed IngestedDocument tracking class as document versioning is now handled by the ingestion pipeline
  • Added Markdown viewer support (viewer.html and viewer.mjs) for rendering .md files
  • Updated citation format to remove page numbers, now supporting document-level citations
  • Changed ingestion trigger from startup to lazy initialization on first search request

Reviewed Changes

Copilot reviewed 94 out of 100 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
VectorStoreWriter.cs Added workaround for QdrantVectorStore key type incompatibility using string name check
SemanticSearch.cs (all variants) Added lazy ingestion on first search with _initialized flag
DocumentReader.cs New custom reader supporting both Markdown and PDF via MarkdownReader and MarkItDownReader
DataIngestor.cs (all variants) Simplified to use IngestionPipeline with SemanticSimilarityChunker
IngestedChunk.cs (all variants) Changed Key type to Guid, made constants public, added JSON serialization attributes
ChatCitation.razor Added Markdown viewer support alongside existing PDF viewer
ChatMessageItem.razor Removed page number from citation regex and data structure
Program.cs variants Removed startup ingestion, added vector store registrations, changed DataIngestor to singleton
*.csproj.in Replaced PdfPig with DataIngestion packages and ML.Tokenizers
THIRD-PARTY-NOTICES.TXT Removed PdfPig license notice
GeneratedContent.targets Updated package version variables

Copy link
Member

@jeffhandley jeffhandley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, @MackinnonBuck!

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big thanks for a great contribution and detailed testing @MackinnonBuck !

Please enable the tracing, this could be done by modyfing:

with:

.AddSource("Experimental.Microsoft.Extensions.DataIngestion");

@MackinnonBuck MackinnonBuck requested a review from a team as a code owner November 7, 2025 19:45
Copy link
Member

@jeffhandley jeffhandley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment that I think needs resolved.

@jeffhandley jeffhandley enabled auto-merge (squash) November 10, 2025 08:16
@jeffhandley
Copy link
Member

@MackinnonBuck FYI - I added some commits, including one that shows a message about documents being loaded. All tests are passing and I did a lot of end-to-end functional validation too. I've marked it to auto-merge when CI is green after my latest push.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are almost there, we just need to disable the incremental ingestion and remove the SK dependency from the PDF reader.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MackinnonBuck To save your time, I've addressed my feedback by pushing to your branch. Please perform the manual verification before merging (I don't know how to do it).

@jeffhandley jeffhandley disabled auto-merge November 10, 2025 16:58
@MackinnonBuck MackinnonBuck merged commit f38b9c0 into main Nov 10, 2025
6 checks passed
@MackinnonBuck MackinnonBuck deleted the mbuck/chat-template-data-ingestion branch November 10, 2025 23:54
joperezr pushed a commit to joperezr/extensions that referenced this pull request Nov 11, 2025
…net#7023)

* Add Markdown support

* Remove PDF support

* Revert "Remove PDF support"

This reverts commit e1d066034962c9686bf8150984b6adf0e25846c8.

* Add 'Example_GPS_Watch.md'

* Add MEDI dependencies

* Revert "[MEDI] Remove collection key type workaround (dotnet#7010)"

This reverts commit a369be9.

* MEDI integration into chat template

* Remove PdfPig dependency

* Fix citation + normalize identifier path

* Undo changes to `M.E.DI.csproj`

* Update snapshots

* Update DataIngestion unit tests to handle keys as either strings or guids

* Update SK and fix MEDI version

* Remove SK workaround

* Fix sandbox paths to allow running tests multiple times

* Reliable data ingestion

* Enable MEDI tracing

* Simplify log message

* Add `PdfPigReader` for non-Aspire template

* Invert PdfPigReader exclusion condition

* Use Markitdown MCP

* Update snapshots

* Undo changes to `IngestionPipelineTests.cs`

* Update src/ProjectTemplates/Microsoft.Extensions.AI.Templates/src/ChatWithCustomData/ChatWithCustomData-CSharp.Web/Services/Ingestion/DocumentReader.cs

Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com>

* Update snapshots

* Improve template execution test failure output

* Support .NET 10 in aichatweb, using it by default

* Show a message when loading documents by loading docs as a separate tool

* disable the incremental ingestion

* map every PDF page to a single section

* drop SK dependency

* Add system prompt instructions for calling the LoadDocuments tool. Fix code formatting.

---------

Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com>
Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
joperezr pushed a commit to joperezr/extensions that referenced this pull request Nov 11, 2025
- Use `Microsoft.Extensions.DataIngestion` in AI Chat Web template (dotnet#7023)
- Add a new Microsoft.Agents.AI.Templates package with an aiagents-webapi project template (dotnet#7014)
- Add Agent Framework DevUI into the aiagent-webapi template (dotnet#7026)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-ai-templates Microsoft.Extensions.AI.Templates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants