Skip to content

Commit 02b50d9

Browse files
singankitthecswpdhotemsJoseCSantosghyadav
authored
Agent Evaluators (#40202)
* Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * Raising error when tool call not found * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * Fixng doc strings and moving sample to a different agent sample folder * Fixing bug with raising exception * Fixing rebase issue * Removing cell outputs to fix spell check erros * Fixing failing tests * Fixing failing test * Spon/update evals converter (#40204) * Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com> * Updating tool call accuracy to use update tool call schema * Update response completeness evaluator based on schema (#40214) * add seed parameter for deterministic results and update completeness return type based on schema. * add unit tests for response completeness. * update completeness key to response completeness. * fixing the unit test for updated completeness key. * updating completeness to responsecompleteness evaluator. * clearing output in response completeness sample notebook. * clearing output in response completeness sample notebook. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Spon/update evals converter (#40215) * Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert * fix built-in tool parsing bug * remove local debug * Formatted and updated the converter to avoid built-in tool crashes. * Added an experimental decorator to AIAgentConverter * Update import path for experimental decorator --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com> Co-authored-by: Sandy Urazayev <surazayev@microsoft.com> * Adding instructions file * Updating default scores * Fix test case for non-agent tool call --------- Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: stevepon <steven@ponshop.net> Co-authored-by: spon <stevenpon@microsoft.com> Co-authored-by: Sandy Urazayev <surazayev@microsoft.com>
1 parent deb1554 commit 02b50d9

39 files changed

+5162
-13
lines changed

.vscode/cspell.json

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@
6868
"sdk/digitaltwins/azure-digitaltwins-core/**",
6969
"sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_vendor/**",
7070
"sdk/evaluation/azure-ai-evaluation/tests/**",
71+
"sdk/evaluation/azure-ai-evaluation/samples/agent_evaluators/**",
7172
"sdk/eventhub/azure-eventhub-checkpointstoretable/**",
7273
"sdk/eventhub/azure-eventhub-checkpointstoreblob-aio/**",
7374
"sdk/eventhub/azure-eventhub/**",
@@ -1401,7 +1402,10 @@
14011402
"upia",
14021403
"xpia",
14031404
"expirable",
1404-
"ralphe"
1405+
"ralphe",
1406+
"Inadherent",
1407+
"nbformat",
1408+
"nbconvert",
14051409
]
14061410
},
14071411
{

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,12 @@
4545
- emotional_state
4646
- protected_class
4747
- groundedness
48+
- New Built-in evaluators for Agent Evaluation (Preview)
49+
- IntentResolutionEvaluator - Evaluates the intent resolution of an agent's response to a user query.
50+
- ResponseCompletenessEvaluator - Evaluates the response completeness of an agent's response to a user query.
51+
- TaskAdherenceEvaluator - Evaluates the task adherence of an agent's response to a user query.
52+
- ToolCallAccuracyEvaluator - Evaluates the accuracy of tool calls made by an agent in response to a user query.
53+
4854

4955
### Breaking Changes
5056

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/__init__.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,16 +17,20 @@
1717
from ._evaluators._gleu import GleuScoreEvaluator
1818
from ._evaluators._groundedness import GroundednessEvaluator
1919
from ._evaluators._service_groundedness import GroundednessProEvaluator
20+
from ._evaluators._intent_resolution import IntentResolutionEvaluator
2021
from ._evaluators._meteor import MeteorScoreEvaluator
2122
from ._evaluators._protected_material import ProtectedMaterialEvaluator
2223
from ._evaluators._qa import QAEvaluator
24+
from ._evaluators._response_completeness import ResponseCompletenessEvaluator
25+
from ._evaluators._task_adherence import TaskAdherenceEvaluator
2326
from ._evaluators._relevance import RelevanceEvaluator
2427
from ._evaluators._retrieval import RetrievalEvaluator
2528
from ._evaluators._rouge import RougeScoreEvaluator, RougeType
2629
from ._evaluators._similarity import SimilarityEvaluator
2730
from ._evaluators._xpia import IndirectAttackEvaluator
2831
from ._evaluators._code_vulnerability import CodeVulnerabilityEvaluator
2932
from ._evaluators._ungrounded_attributes import UngroundedAttributesEvaluator
33+
from ._evaluators._tool_call_accuracy import ToolCallAccuracyEvaluator
3034
from ._model_configurations import (
3135
AzureAIProject,
3236
AzureOpenAIModelConfiguration,
@@ -37,13 +41,26 @@
3741
OpenAIModelConfiguration,
3842
)
3943

44+
# The converter from the AI service to the evaluator schema requires a dependency on
45+
# ai.projects, but we also don't want to force users installing ai.evaluations to pull
46+
# in ai.projects. So we only import it if it's available and the user has ai.projects.
47+
try:
48+
from ._converters._ai_services import AIAgentConverter
49+
_patch_all = ["AIAgentConverter"]
50+
except ImportError:
51+
print("Could not import AIAgentConverter. Please install the dependency with `pip install azure-ai-projects`.")
52+
_patch_all = []
53+
4054
__all__ = [
4155
"evaluate",
4256
"CoherenceEvaluator",
4357
"F1ScoreEvaluator",
4458
"FluencyEvaluator",
4559
"GroundednessEvaluator",
4660
"GroundednessProEvaluator",
61+
"ResponseCompletenessEvaluator",
62+
"TaskAdherenceEvaluator",
63+
"IntentResolutionEvaluator",
4764
"RelevanceEvaluator",
4865
"SimilarityEvaluator",
4966
"QAEvaluator",
@@ -69,4 +86,7 @@
6986
"EvaluationResult",
7087
"CodeVulnerabilityEvaluator",
7188
"UngroundedAttributesEvaluator",
89+
"ToolCallAccuracyEvaluator",
7290
]
91+
92+
__all__.extend([p for p in _patch_all if p not in __all__])

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_common/constants.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@
55

66
from azure.core import CaseInsensitiveEnumMeta
77

8-
9-
PROMPT_BASED_REASON_EVALUATORS = ["coherence", "relevance", "retrieval", "groundedness", "fluency"]
8+
PROMPT_BASED_REASON_EVALUATORS = ["coherence", "relevance", "retrieval", "groundedness", "fluency", "intent_resolution",
9+
"tool_call_accurate", "response_completeness", "task_adherence"]
1010

1111

1212
class CommonConstants:

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_common/utils.py

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -274,8 +274,26 @@ def validate_annotation(v: object, annotation: Union[str, type, object]) -> bool
274274

275275
return cast(T_TypedDict, o)
276276

277+
def check_score_is_valid(score: Union[str, float], min_score = 1, max_score = 5) -> bool:
278+
"""Check if the score is valid, i.e. is convertable to number and is in the range [min_score, max_score].
279+
280+
:param score: The score to check.
281+
:type score: Union[str, float]
282+
:param min_score: The minimum score. Default is 1.
283+
:type min_score: int
284+
:param max_score: The maximum score. Default is 5.
285+
:type max_score: int
286+
:return: True if the score is valid, False otherwise.
287+
:rtype: bool
288+
"""
289+
try:
290+
numeric_score = float(score)
291+
except (ValueError, TypeError):
292+
return False
293+
294+
return min_score <= numeric_score <= max_score
277295

278-
def parse_quality_evaluator_reason_score(llm_output: str) -> Tuple[float, str]:
296+
def parse_quality_evaluator_reason_score(llm_output: str, valid_score_range: str = "[1-5]") -> Tuple[float, str]:
279297
"""Parse the output of prompt-based quality evaluators that return a score and reason.
280298
281299
Current supported evaluators:
@@ -284,6 +302,8 @@ def parse_quality_evaluator_reason_score(llm_output: str) -> Tuple[float, str]:
284302
- Retrieval
285303
- Groundedness
286304
- Coherence
305+
- ResponseCompleteness
306+
- TaskAdherence
287307
288308
:param llm_output: The output of the prompt-based quality evaluator.
289309
:type llm_output: str
@@ -294,7 +314,7 @@ def parse_quality_evaluator_reason_score(llm_output: str) -> Tuple[float, str]:
294314
reason = ""
295315
if llm_output:
296316
try:
297-
score_pattern = r"<S2>\D*?([1-5]).*?</S2>"
317+
score_pattern = rf"<S2>\D*?({valid_score_range}).*?</S2>"
298318
reason_pattern = r"<S1>(.*?)</S1>"
299319
score_match = re.findall(score_pattern, llm_output, re.DOTALL)
300320
reason_match = re.findall(reason_pattern, llm_output, re.DOTALL)
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# ---------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# ---------------------------------------------------------

0 commit comments

Comments
 (0)