Skip to content

Conversation

@shyamnamboodiripad
Copy link
Contributor

@shyamnamboodiripad shyamnamboodiripad commented Apr 24, 2025

The prompts for these evaluators now match those that are used within the corresponding Python evaluators in the Azure AI Python SDK.

This PR also -

  • Removes ChatConversationEvaluator and SingleNumericScoreEvaluator from the Quality package.

    These types were starting to become unwieldy, and we have had to tweak their public API every time we introduced new derived evaluators (each with slightly different requirements). One reason this was proving brittle was because each derived type supplied its own prompt, and the above base types were trying to generalize shared aspects of the implementation such as input validation, LLM response parsing etc. However, these aspects started to diverge as the prompts used in the derived evaluators started diverging (both when prompts for some of the existing evaluators were updated, as well as when new derived evaluators that used different styles of prompts were added to the mix).

    Preserving these base types as part of the public API would have added some risk to our GA plans for the evaluation libraries (since we would no longer have the luxury to make similar tweaks to the public base type contract post GA). So we decided that it would be best to remove them.

  • Introduces a set of more granular helper extension methods that can be used when authoring evaluators (and updates all the evaluators in the Quality package to use these helpers instead of the base types above).

    These helper extension methods (that have been added as part of the core Microsoft.Extensions.AI.Evaluation package) make it easy to accomplish common tasks such as rendering conversation history as part of an evaluation prompt, extracting the final user request from the conversation history, recording metadata present in the evaluation ChatResponse as part of the EvaluationMetric produced by an evaluator etc.

    Evaluator-specific logic such as input validation and evaluation prompt construction are now handled individually within each Quality evaluator. Parsing of the evaluation responses is also handled via a couple of new extension method helpers - however the response parsing logic is specific to the prompts used within the Quality evaluators - so the helpers for parsing evaluation responses have been added in the Microsoft.Extensions.AI.Evaluation.Quality package and are internal to this package.

  • Introduces 3 new evaluators as part of the Microsoft.Extensions.AI.Evaluation.Quality package: RetrievalEvaluator, RelevenceEvaluator and CompletenessEvaluator.

  • Marks the RelevanceTruthAndCompletenessEvaluator as Experimental since it needs some updates (such as [AI Evaluation] Add grounding EvaluationContext for RelevanceTruthAndCompletenessEvaluator #6294), and since we also have new dedicated evaluators for Relevance and Competeness.

  • Removes abstract from ContentHarmEvaluator. It is now possible to instantiate ContentHarmEvaluator and use this instance to evaluate all the content harm metrics via a single request to the Azure AI Content Safety service.

  • Introduces a Name property for EvaluationContext and replaces the GetContents method with a Contents property.

  • Simplifies how context is included in metrics. EvaluationContext objects can now be directly added to EvaluationMetrics. However, as before, only the properties on the base EvaluationContext type (i.e. Name and Contents) are included when serializing and deserializing metrics as part of the result storage / reporting functionality.

TODO:

  • Add / update tests for some of the new evaluators / functionality above
  • Some (mostly minor) cleanup and refactoring

Fixes #5952 and #6028

Microsoft Reviewers: Open in CodeFlow

@github-actions github-actions bot added the area-ai-eval Microsoft.Extensions.AI.Evaluation and related label Apr 24, 2025
@shyamnamboodiripad shyamnamboodiripad force-pushed the quality branch 4 times, most recently from 7872210 to e940059 Compare April 25, 2025 10:53
@shyamnamboodiripad shyamnamboodiripad changed the title [WIP] Update prompts for quality evaluators Update prompts for quality evaluators Apr 25, 2025
@shyamnamboodiripad shyamnamboodiripad marked this pull request as ready for review April 25, 2025 11:27
@shyamnamboodiripad shyamnamboodiripad requested a review from a team as a code owner April 25, 2025 11:27
@shyamnamboodiripad shyamnamboodiripad force-pushed the quality branch 4 times, most recently from 7df91da to ea138d2 Compare April 28, 2025 10:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-ai-eval Microsoft.Extensions.AI.Evaluation and related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AI Evaluation] Update prompts for Quality evaluators

2 participants