Skip to content

Commit 1f1fc11

Browse files
steveponsingankitthecswpdhotemsJoseCSantos
authored
Spon/update evals converter (#40215)
* Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert * fix built-in tool parsing bug * remove local debug * Formatted and updated the converter to avoid built-in tool crashes. * Added an experimental decorator to AIAgentConverter * Update import path for experimental decorator --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com> Co-authored-by: Sandy Urazayev <surazayev@microsoft.com>
1 parent 0d93869 commit 1f1fc11

File tree

4 files changed

+336
-72
lines changed

4 files changed

+336
-72
lines changed

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py

Lines changed: 46 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@
1212

1313
from typing import List, Union
1414

15+
from azure.ai.evaluation._common._experimental import experimental
16+
1517
# Constants.
1618
from ._models import _USER, _AGENT, _TOOL, _TOOL_CALL, _TOOL_CALLS, _FUNCTION
1719

@@ -30,50 +32,47 @@
3032
# Maximum number of workers allowed to make API calls at the same time.
3133
_MAX_WORKERS = 10
3234

35+
# Constants to only be used internally in this file for the built-in tools.
36+
_CODE_INTERPRETER = "code_interpreter"
37+
_BING_GROUNDING = "bing_grounding"
38+
_FILE_SEARCH = "file_search"
39+
3340
# Built-in tool descriptions and parameters are hidden, but we include basic descriptions
34-
# for evaluation purposes
35-
_BUILT_IN_DESCRIPTIONS = {"code_interpreter": "Use code interpreter to read and interpret information from datasets, "
36-
"generate code, and create graphs and charts using your data. Supports "
37-
"up to 20 files.",
38-
"bing_grounding": "Enhance model output with web data.",
39-
"file_search": "Search for data across uploaded files.",
40-
}
41-
42-
_BUILT_IN_PARAMS = {"code_interpreter": {"type": "object",
43-
"properties": {
44-
"input": {
45-
"type": "string",
46-
"description": "Generated code to be executed."
47-
}
48-
}
49-
},
50-
"bing_grounding": {"type": "object",
51-
"properties": {
52-
"requesturl": {
53-
"type": "string",
54-
"description": "URL used in Bing Search API."
55-
}
56-
}
57-
},
58-
"file_search": {"type": "object",
59-
"properties": {
60-
"ranking_options": {
61-
"type": "object",
62-
"properties": {
63-
"ranker": {
64-
"type": "string",
65-
"description": "Ranking algorithm to use."
66-
},
67-
"score_threshold": {
68-
"type": "number",
69-
"description": "Threshold for search results."
70-
}
71-
},
72-
"description": "Ranking options for search results."
73-
}
74-
}
75-
},
76-
}
41+
# for evaluation purposes.
42+
_BUILT_IN_DESCRIPTIONS = {
43+
_CODE_INTERPRETER: "Use code interpreter to read and interpret information from datasets, "
44+
+ "generate code, and create graphs and charts using your data. Supports "
45+
+ "up to 20 files.",
46+
_BING_GROUNDING: "Enhance model output with web data.",
47+
_FILE_SEARCH: "Search for data across uploaded files.",
48+
}
49+
50+
# Built-in tool parameters are hidden, but we include basic parameters for evaluation purposes.
51+
_BUILT_IN_PARAMS = {
52+
_CODE_INTERPRETER: {
53+
"type": "object",
54+
"properties": {"input": {"type": "string", "description": "Generated code to be executed."}},
55+
},
56+
_BING_GROUNDING: {
57+
"type": "object",
58+
"properties": {"requesturl": {"type": "string", "description": "URL used in Bing Search API."}},
59+
},
60+
_FILE_SEARCH: {
61+
"type": "object",
62+
"properties": {
63+
"ranking_options": {
64+
"type": "object",
65+
"properties": {
66+
"ranker": {"type": "string", "description": "Ranking algorithm to use."},
67+
"score_threshold": {"type": "number", "description": "Threshold for search results."},
68+
},
69+
"description": "Ranking options for search results.",
70+
}
71+
},
72+
},
73+
}
74+
75+
@experimental
7776
class AIAgentConverter:
7877
"""
7978
A converter for AI agent data.
@@ -192,6 +191,7 @@ def _extract_function_tool_definitions(thread_run: ThreadRun) -> List[ToolDefini
192191
"""
193192
final_tools: List[ToolDefinition] = []
194193
for tool in thread_run.tools:
194+
# Here we handle the custom functions and create tool definitions out of them.
195195
if tool.type == _FUNCTION:
196196
tool_function: FunctionDefinition = tool.function
197197
parameters = tool_function.parameters
@@ -208,19 +208,16 @@ def _extract_function_tool_definitions(thread_run: ThreadRun) -> List[ToolDefini
208208
)
209209
)
210210
else:
211-
# add limited support for built-in tools. Descriptions and parameters
211+
# Add limited support for built-in tools. Descriptions and parameters
212212
# are not published, but we'll include placeholders.
213-
try:
213+
if tool.type in _BUILT_IN_DESCRIPTIONS and tool.type in _BUILT_IN_PARAMS:
214214
final_tools.append(
215215
ToolDefinition(
216216
name=tool.type,
217217
description=_BUILT_IN_DESCRIPTIONS[tool.type],
218-
parameters=_BUILT_IN_PARAMS[tool.type]
218+
parameters=_BUILT_IN_PARAMS[tool.type],
219219
)
220220
)
221-
except:
222-
# if we run into an unknown tool, don't fail
223-
pass
224221
return final_tools
225222

226223
@staticmethod

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -182,13 +182,15 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
182182
# Treat built-in tools separately. Object models may be unique so handle each case separately
183183
# Just converting to dicts here rather than custom serializers for simplicity for now.
184184
# Don't fail if we run into a newly seen tool, just skip
185-
if tool_call.details.type == "code_interpreter":
185+
if tool_call.details["type"] == "code_interpreter":
186186
arguments = {"input": tool_call.details.code_interpreter.input}
187-
elif tool_call.details.type == "bing_grounding":
188-
arguments = {"requesturl": tool_call.details.bing_grounding.requesturl}
189-
elif tool_call.details.type == "file_search":
190-
options = tool_call.details.file_search.ranking_options
191-
arguments = {"ranking_options": {"ranker": options.ranker, "score_threshold": options.score_threshold}}
187+
elif tool_call.details["type"] == "bing_grounding":
188+
arguments = {"requesturl": tool_call.details["bing_grounding"]["requesturl"]}
189+
elif tool_call.details["type"] == "file_search":
190+
options = tool_call.details["file_search"]["ranking_options"]
191+
arguments = {
192+
"ranking_options": {"ranker": options["ranker"], "score_threshold": options["score_threshold"]}
193+
}
192194
else:
193195
# unsupported tool type, skip
194196
return messages
@@ -218,9 +220,17 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
218220
if tool_call.details.type == "code_interpreter":
219221
output = tool_call.details.code_interpreter.outputs
220222
elif tool_call.details.type == "bing_grounding":
221-
return messages # not supported yet from bing grounding tool
223+
return messages # not supported yet from bing grounding tool
222224
elif tool_call.details.type == "file_search":
223-
output = [{"file_id": result.file_id, "file_name": result.file_name, "score": result.score, "content": result.content} for result in tool_call.details.file_search.results]
225+
output = [
226+
{
227+
"file_id": result.file_id,
228+
"file_name": result.file_name,
229+
"score": result.score,
230+
"content": result.content,
231+
}
232+
for result in tool_call.details.file_search.results
233+
]
224234
except:
225235
return messages
226236

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Serialization and deserialization helper functions to make test-writing easier
2+
3+
import json
4+
from datetime import datetime
5+
6+
from azure.ai.evaluation._converters._models import ToolCall
7+
from azure.ai.projects.models import RunStepCodeInterpreterToolCall, RunStepCodeInterpreterToolCallDetails, \
8+
RunStepFileSearchToolCall, RunStepFileSearchToolCallResults, RunStepFileSearchToolCallResult, \
9+
FileSearchRankingOptions, RunStepBingGroundingToolCall
10+
11+
12+
class ToolDecoder(json.JSONDecoder):
13+
def __init__(self, *args, **kwargs):
14+
super().__init__(object_hook=self.object_hook, *args, **kwargs)
15+
16+
def object_hook(self, obj):
17+
if 'completed' in obj and 'created' in obj and 'details' in obj:
18+
return ToolCall(
19+
created=datetime.fromisoformat(obj['created']),
20+
completed=datetime.fromisoformat(obj['completed']),
21+
details=self.decode_details(obj['details'])
22+
)
23+
return obj
24+
25+
def decode_details(self, details):
26+
if 'id' in details and 'type' in details:
27+
if details['type'] == 'code_interpreter':
28+
return RunStepCodeInterpreterToolCall(
29+
id=details['id'],
30+
code_interpreter=RunStepCodeInterpreterToolCallDetails(
31+
input=details['code_interpreter']['input'],
32+
outputs=details['code_interpreter']['outputs']
33+
)
34+
)
35+
elif details['type'] == 'file_search':
36+
return RunStepFileSearchToolCall(
37+
id=details['id'],
38+
file_search=RunStepFileSearchToolCallResults(
39+
results=[
40+
RunStepFileSearchToolCallResult(
41+
file_name=result['file_name'],
42+
file_id=result['file_id'],
43+
score=result['score'],
44+
content=result['content']
45+
) for result in details['file_search']['results']
46+
],
47+
ranking_options=FileSearchRankingOptions(
48+
ranker=details['file_search']['ranking_options']['ranker'],
49+
score_threshold=details['file_search']['ranking_options']['score_threshold']
50+
)
51+
)
52+
)
53+
elif details['type'] == 'bing_grounding':
54+
return RunStepBingGroundingToolCall(
55+
id=details['id'],
56+
bing_grounding=details['bing_grounding']
57+
)
58+
return details
59+
60+
class ToolEncoder(json.JSONEncoder):
61+
def default(self, obj):
62+
if isinstance(obj, datetime):
63+
return obj.isoformat()
64+
if isinstance(obj, ToolCall):
65+
return {
66+
"completed": obj.completed,
67+
"created": obj.created,
68+
"details": obj.details
69+
}
70+
if isinstance(obj, RunStepCodeInterpreterToolCall):
71+
return {
72+
"id": obj.id,
73+
# "type": obj.type,
74+
"code_interpreter": obj.code_interpreter
75+
}
76+
if isinstance(obj, RunStepCodeInterpreterToolCallDetails):
77+
return {
78+
"input": obj.input,
79+
"outputs": obj.outputs
80+
}
81+
if isinstance(obj, RunStepFileSearchToolCall):
82+
return {
83+
"id": obj.id,
84+
# "type": obj.type,
85+
"file_search": obj.file_search
86+
}
87+
if isinstance(obj, RunStepFileSearchToolCallResults):
88+
return {
89+
"results": obj.results,
90+
"ranking_options": obj.ranking_options
91+
}
92+
if isinstance(obj, RunStepFileSearchToolCallResult):
93+
return {
94+
"file_name": obj.file_name,
95+
"file_id": obj.file_id,
96+
"score": obj.score,
97+
"content": obj.content
98+
}
99+
if isinstance(obj, FileSearchRankingOptions):
100+
return {
101+
"ranker": obj.ranker,
102+
"score_threshold": obj.score_threshold
103+
}
104+
if isinstance(obj, RunStepBingGroundingToolCall):
105+
return {
106+
"id": obj.id,
107+
# "type": obj.type,
108+
"bing_grounding": obj.bing_grounding
109+
}
110+
return super().default(obj)

0 commit comments

Comments
 (0)