Update prompts for quality evaluators #6328

shyamnamboodiripad · 2025-04-24T08:11:52Z

The prompts for these evaluators now match those that are used within the corresponding Python evaluators in the Azure AI Python SDK.

This PR also -

Removes ChatConversationEvaluator and SingleNumericScoreEvaluator from the Quality package.

These types were starting to become unwieldy, and we have had to tweak their public API every time we introduced new derived evaluators (each with slightly different requirements). One reason this was proving brittle was because each derived type supplied its own prompt, and the above base types were trying to generalize shared aspects of the implementation such as input validation, LLM response parsing etc. However, these aspects started to diverge as the prompts used in the derived evaluators started diverging (both when prompts for some of the existing evaluators were updated, as well as when new derived evaluators that used different styles of prompts were added to the mix).

Preserving these base types as part of the public API would have added some risk to our GA plans for the evaluation libraries (since we would no longer have the luxury to make similar tweaks to the public base type contract post GA). So we decided that it would be best to remove them.
Introduces a set of more granular helper extension methods that can be used when authoring evaluators (and updates all the evaluators in the Quality package to use these helpers instead of the base types above).

These helper extension methods (that have been added as part of the core Microsoft.Extensions.AI.Evaluation package) make it easy to accomplish common tasks such as rendering conversation history as part of an evaluation prompt, extracting the final user request from the conversation history, recording metadata present in the evaluation ChatResponse as part of the EvaluationMetric produced by an evaluator etc.

Evaluator-specific logic such as input validation and evaluation prompt construction are now handled individually within each Quality evaluator. Parsing of the evaluation responses is also handled via a couple of new extension method helpers - however the response parsing logic is specific to the prompts used within the Quality evaluators - so the helpers for parsing evaluation responses have been added in the Microsoft.Extensions.AI.Evaluation.Quality package and are internal to this package.
Introduces 3 new evaluators as part of the Microsoft.Extensions.AI.Evaluation.Quality package: RetrievalEvaluator, RelevenceEvaluator and CompletenessEvaluator.
Marks the RelevanceTruthAndCompletenessEvaluator as Experimental since it needs some updates (such as [AI Evaluation] Add grounding EvaluationContext for RelevanceTruthAndCompletenessEvaluator #6294), and since we also have new dedicated evaluators for Relevance and Competeness.
Removes abstract from ContentHarmEvaluator. It is now possible to instantiate ContentHarmEvaluator and use this instance to evaluate all the content harm metrics via a single request to the Azure AI Content Safety service.
Introduces a Name property for EvaluationContext and replaces the GetContents method with a Contents property.
Simplifies how context is included in metrics. EvaluationContext objects can now be directly added to EvaluationMetrics. However, as before, only the properties on the base EvaluationContext type (i.e. Name and Contents) are included when serializing and deserializing metrics as part of the result storage / reporting functionality.

TODO:

Add / update tests for some of the new evaluators / functionality above
Some (mostly minor) cleanup and refactoring

Fixes #5952 and #6028

Microsoft Reviewers: Open in CodeFlow

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/Utilities/ParsingUtilities.cs

src/Libraries/Microsoft.Extensions.AI.Evaluation/ChatMessageExtensions.cs

…ought

Introduces a Name property on EvaluationContext and changes the GetContents() method to a Contents property.

github-actions bot added the area-ai-eval Microsoft.Extensions.AI.Evaluation and related label Apr 24, 2025

dotnet-policy-service bot assigned shyamnamboodiripad Apr 24, 2025

peterwald reviewed Apr 24, 2025

View reviewed changes

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/Utilities/ParsingUtilities.cs Outdated Show resolved Hide resolved

peterwald reviewed Apr 24, 2025

View reviewed changes

src/Libraries/Microsoft.Extensions.AI.Evaluation/ChatMessageExtensions.cs Show resolved Hide resolved

peterwald reviewed Apr 24, 2025

View reviewed changes

src/Libraries/Microsoft.Extensions.AI.Evaluation/ChatMessageExtensions.cs Outdated Show resolved Hide resolved

shyamnamboodiripad force-pushed the quality branch 4 times, most recently from 7872210 to e940059 Compare April 25, 2025 10:53

shyamnamboodiripad changed the title ~~[WIP] Update prompts for quality evaluators~~ Update prompts for quality evaluators Apr 25, 2025

shyamnamboodiripad marked this pull request as ready for review April 25, 2025 11:27

shyamnamboodiripad requested a review from a team as a code owner April 25, 2025 11:27

shyamnamboodiripad force-pushed the quality branch from e940059 to 4197c87 Compare April 25, 2025 11:59

peterwald approved these changes Apr 25, 2025

View reviewed changes

shyamnamboodiripad force-pushed the quality branch 4 times, most recently from 7df91da to ea138d2 Compare April 28, 2025 10:51

shyamnamboodiripad added 4 commits April 28, 2025 04:25

Add support for parsing responses that include reason and chain of th…

1781c3c

…ought

Update prompts for Quality evaluators.

3df12ee

Simplify how context is included in metrics

68ced0d

Introduces a Name property on EvaluationContext and changes the GetContents() method to a Contents property.

Add tests for new evaluators.

23a592d

shyamnamboodiripad force-pushed the quality branch from ea138d2 to 23a592d Compare April 28, 2025 11:47

shyamnamboodiripad enabled auto-merge (squash) April 28, 2025 12:20

shyamnamboodiripad merged commit 28412d8 into dotnet:main Apr 28, 2025
6 checks passed

shyamnamboodiripad deleted the quality branch April 28, 2025 12:25

shyamnamboodiripad mentioned this pull request Apr 28, 2025

[AI Evaluation] GroundednessEvaluator does not return reason for score #6028

Closed

This was referenced May 5, 2025

[AI Evaluation] ChatConversationEvaluator expect UserRequest to be the last message in messages parameter #6369

Closed

Some API related fixes for the evaluation libraries #6402

Merged

github-actions bot locked and limited conversation to collaborators May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update prompts for quality evaluators #6328

Update prompts for quality evaluators #6328

Uh oh!

shyamnamboodiripad commented Apr 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update prompts for quality evaluators #6328

Update prompts for quality evaluators #6328

Uh oh!

Conversation

shyamnamboodiripad commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microsoft Reviewers: Open in CodeFlow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shyamnamboodiripad commented Apr 24, 2025 •

edited

Loading