-
Notifications
You must be signed in to change notification settings - Fork 840
Update prompts for quality evaluators #6328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
peterwald
reviewed
Apr 24, 2025
src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/Utilities/ParsingUtilities.cs
Outdated
Show resolved
Hide resolved
peterwald
reviewed
Apr 24, 2025
src/Libraries/Microsoft.Extensions.AI.Evaluation/ChatMessageExtensions.cs
Show resolved
Hide resolved
peterwald
reviewed
Apr 24, 2025
src/Libraries/Microsoft.Extensions.AI.Evaluation/ChatMessageExtensions.cs
Outdated
Show resolved
Hide resolved
7872210 to
e940059
Compare
e940059 to
4197c87
Compare
peterwald
approved these changes
Apr 25, 2025
7df91da to
ea138d2
Compare
Introduces a Name property on EvaluationContext and changes the GetContents() method to a Contents property.
ea138d2 to
23a592d
Compare
This was referenced May 5, 2025
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The prompts for these evaluators now match those that are used within the corresponding Python evaluators in the Azure AI Python SDK.
This PR also -
Removes
ChatConversationEvaluatorandSingleNumericScoreEvaluatorfrom the Quality package.These types were starting to become unwieldy, and we have had to tweak their public API every time we introduced new derived evaluators (each with slightly different requirements). One reason this was proving brittle was because each derived type supplied its own prompt, and the above base types were trying to generalize shared aspects of the implementation such as input validation, LLM response parsing etc. However, these aspects started to diverge as the prompts used in the derived evaluators started diverging (both when prompts for some of the existing evaluators were updated, as well as when new derived evaluators that used different styles of prompts were added to the mix).
Preserving these base types as part of the public API would have added some risk to our GA plans for the evaluation libraries (since we would no longer have the luxury to make similar tweaks to the public base type contract post GA). So we decided that it would be best to remove them.
Introduces a set of more granular helper extension methods that can be used when authoring evaluators (and updates all the evaluators in the Quality package to use these helpers instead of the base types above).
These helper extension methods (that have been added as part of the core
Microsoft.Extensions.AI.Evaluationpackage) make it easy to accomplish common tasks such as rendering conversation history as part of an evaluation prompt, extracting the final user request from the conversation history, recording metadata present in the evaluationChatResponseas part of theEvaluationMetricproduced by an evaluator etc.Evaluator-specific logic such as input validation and evaluation prompt construction are now handled individually within each Quality evaluator. Parsing of the evaluation responses is also handled via a couple of new extension method helpers - however the response parsing logic is specific to the prompts used within the Quality evaluators - so the helpers for parsing evaluation responses have been added in the
Microsoft.Extensions.AI.Evaluation.Qualitypackage and are internal to this package.Introduces 3 new evaluators as part of the
Microsoft.Extensions.AI.Evaluation.Qualitypackage:RetrievalEvaluator,RelevenceEvaluatorandCompletenessEvaluator.Marks the
RelevanceTruthAndCompletenessEvaluatorasExperimentalsince it needs some updates (such as [AI Evaluation] Add grounding EvaluationContext for RelevanceTruthAndCompletenessEvaluator #6294), and since we also have new dedicated evaluators forRelevanceandCompeteness.Removes
abstractfromContentHarmEvaluator. It is now possible to instantiateContentHarmEvaluatorand use this instance to evaluate all the content harm metrics via a single request to the Azure AI Content Safety service.Introduces a
Nameproperty forEvaluationContextand replaces theGetContentsmethod with aContentsproperty.Simplifies how context is included in metrics.
EvaluationContextobjects can now be directly added toEvaluationMetrics. However, as before, only the properties on the baseEvaluationContexttype (i.e.NameandContents) are included when serializing and deserializing metrics as part of the result storage / reporting functionality.TODO:
Fixes #5952 and #6028
Microsoft Reviewers: Open in CodeFlow