Spon/update evals converter (#40215)

stevepon · singankit · thecsw · web-flow · commit 1f1fc11432eb · 2025-03-25T12:32:50.000-07:00
* Tool Call Accuracy Evaluator (#40068) * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Tool Call Accuracy Evaluator * Review comments * Updating score key and output structure * Updating prompt * Renaming parameter * Converter from AI Service threads/runs to evaluator-compatible schema (#40047) * WIP AIAgentConverter * Added the v1 of the converter * Updated the AIAgentConverter with different output schemas. * ruff format * Update the top schema to have: query, response, tool_definitions * "agentic" is not a recognized word, change the wording. * System message always comes first in query with multiple runs. * Add support for getting inputs from local files with run_ids. * Export AIAgentConverter through azure.ai.evaluation, local read updates * Use from ._models import * Ruff format again. * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Simplify the API by rolling up the static methods and hiding internals. * Lock the ._converters._ai_services behind an import error. * Print to install azure-ai-projects if we can't import AIAgentConverter * By default, include all previous runs' tool calls and results. * Don't crash if there is no content in historical thread messages. * Parallelize the calls to get step_details for each run_id. * Addressing PR comments. * Use a single underscore to hide internal static members. --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> * Adding intent_resolution_evaluator to prp/agent_evaluators branch (#40065) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Add Task Adherence and Completeness (#40098) * Agentic Evaluator - Response Completeness * Added Change Log for Response Completeness Agentic Evaluator * Task Adherence Agentic Evaluator * Add Task Adherence Evaluator to changelog * fixing contracts for Completeness and Task Adherence Evaluators * Enhancing Contract for Task Adherence and Response Completeness Agentic Evaluator * update completeness implementation. * update the completeness evaluator response to include threshold comparison. * updating the implementation for completeness. * updating the type for completeness score. * updating the parsing logic for llm output of completeness. * updating the response dict for completeness. * Adding Task adherence * Adding Task Adherence evaluator with samples * Delete old files * updating the exception for completeness evaluator. * Changing docstring * Adding changelog * Use _result_key * Add admonition --------- Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Adding bug bash sample and instructions (#40125) * Adding bug bash sample and instructions * Updating instructions * Update instructions.md * Adding instructions and evaluator to agent evaluation sample * add bug bash sample notebook for response completeness evaluator. (#40139) * add bug bash sample notebook for response completeness evaluator. * update the notebook for completeness. --------- Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> * Sample specific for tool call accuracy evaluator (#40135) * Update instructions.md * Add IntentResolution evaluator bug bash notebook (#40144) * Add intent resolution evaluator * updated intent_resolution evaluator logic * Remove spurious print statements * Address reviewers feedback * add threshold key, update result to pass/fail rather than True/False * Add example + remove repeated fields * Harden check_score_is_valid function * Sample notebook to demo intent_resolution evaluator * Add synthetic data and section on how to test data from disk * Update instructions.md * Update _tool_call_accuracy.py * Improve task adherence prompt and add sample notebook for bugbash (#40146) * For ComputeInstance and AmlCompute update disableLocalAuth property based on ssh_public_access (#39934) * add disableLocalAuth for computeInstance * fix disableLocalAuthAuth issue for amlCompute * update compute instance * update recordings * temp changes * Revert "temp changes" This reverts commit 64e3c38. * update recordings * fix tests * Add resource prefix for safe secret standard alerts (#40028) Add the prefix to identify RGs that we are creating in our TME tenant to identify them as potentially using local auth and violating our safe secret standards. Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * Add examples to task_adherence prompt. Add Task Adherence sample notebook * Undo changes to New-TestResources.ps1 * Add sample .env file --------- Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> * [AIAgentConverter] Added support for converting entire threads. (#40178) * Implemented prepare_evaluation_data * Add support for retrieving multiple threads into the same file. * Parallelize thread preparing across threads. * Set the maximum number of workers in thread pools to 10. * Users/singankit/tool call accuracy evaluator tests (#40190) * Raising error when tool call not found * Adding unit tests for tool call accuracy evaluator * Updating sample * update output of converter for tool calls * add built-ins * handle file search * remove extra files * revert * revert * fix built-in tool parsing bug * remove local debug * Formatted and updated the converter to avoid built-in tool crashes. * Added an experimental decorator to AIAgentConverter * Update import path for experimental decorator --------- Co-authored-by: Ankit Singhal <30610298+singankit@users.noreply.github.com> Co-authored-by: Sandy <16922860+thecsw@users.noreply.github.com> Co-authored-by: Prashant Dhote <168401122+pdhotems@users.noreply.github.com> Co-authored-by: Jose Santos <jcsantos@microsoft.com> Co-authored-by: ghyadav <103428325+ghyadav@users.noreply.github.com> Co-authored-by: Shiprajain01 <shiprajain01@microsoft.com> Co-authored-by: ShipraJain01 <103409614+ShipraJain01@users.noreply.github.com> Co-authored-by: Chandra Sekhar Gupta Aravapalli <caravapalli@microsoft.com> Co-authored-by: Ankit Singhal <anksing@microsoft.com> Co-authored-by: Chandra Sekhar Gupta <38103118+guptha23@users.noreply.github.com> Co-authored-by: Azure SDK Bot <53356347+azure-sdk@users.noreply.github.com> Co-authored-by: Wes Haggard <Wes.Haggard@microsoft.com> Co-authored-by: spon <stevenpon@microsoft.com> Co-authored-by: Sandy Urazayev <surazayev@microsoft.com>
diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_ai_services.py
@@ -12,6 +12,8 @@
 
 from typing import List, Union
 
+from azure.ai.evaluation._common._experimental import experimental
+
 # Constants.
 from ._models import _USER, _AGENT, _TOOL, _TOOL_CALL, _TOOL_CALLS, _FUNCTION
 
@@ -30,50 +32,47 @@
 # Maximum number of workers allowed to make API calls at the same time.
 _MAX_WORKERS = 10
 
+# Constants to only be used internally in this file for the built-in tools.
+_CODE_INTERPRETER = "code_interpreter"
+_BING_GROUNDING = "bing_grounding"
+_FILE_SEARCH = "file_search"
+
 # Built-in tool descriptions and parameters are hidden, but we include basic descriptions
-# for evaluation purposes
-_BUILT_IN_DESCRIPTIONS = {"code_interpreter": "Use code interpreter to read and interpret information from datasets, "
-                                              "generate code, and create graphs and charts using your data. Supports "
-                                              "up to 20 files.",
-                            "bing_grounding": "Enhance model output with web data.",
-                            "file_search": "Search for data across uploaded files.",
-                          }
-
-_BUILT_IN_PARAMS = {"code_interpreter": {"type": "object",
-                                         "properties": {
-                                             "input": {
-                                                "type": "string",
-                                                "description": "Generated code to be executed."
-                                                }
-                                             }
-                                         },
-                    "bing_grounding": {"type": "object",
-                                         "properties": {
-                                             "requesturl": {
-                                                "type": "string",
-                                                "description": "URL used in Bing Search API."
-                                                }
-                                             }
-                                         },
-                    "file_search": {"type": "object",
-                                       "properties": {
-                                           "ranking_options": {
-                                               "type": "object",
-                                               "properties": {
-                                                    "ranker": {
-                                                         "type": "string",
-                                                         "description": "Ranking algorithm to use."
-                                                         },
-                                                    "score_threshold": {
-                                                         "type": "number",
-                                                         "description": "Threshold for search results."
-                                                         }
-                                                    },
-                                               "description": "Ranking options for search results."
-                                           }
-                                       }
-                                    },
-                    }
+# for evaluation purposes.
+_BUILT_IN_DESCRIPTIONS = {
+    _CODE_INTERPRETER: "Use code interpreter to read and interpret information from datasets, "
+    + "generate code, and create graphs and charts using your data. Supports "
+    + "up to 20 files.",
+    _BING_GROUNDING: "Enhance model output with web data.",
+    _FILE_SEARCH: "Search for data across uploaded files.",
+}
+
+# Built-in tool parameters are hidden, but we include basic parameters for evaluation purposes.
+_BUILT_IN_PARAMS = {
+    _CODE_INTERPRETER: {
+        "type": "object",
+        "properties": {"input": {"type": "string", "description": "Generated code to be executed."}},
+    },
+    _BING_GROUNDING: {
+        "type": "object",
+        "properties": {"requesturl": {"type": "string", "description": "URL used in Bing Search API."}},
+    },
+    _FILE_SEARCH: {
+        "type": "object",
+        "properties": {
+            "ranking_options": {
+                "type": "object",
+                "properties": {
+                    "ranker": {"type": "string", "description": "Ranking algorithm to use."},
+                    "score_threshold": {"type": "number", "description": "Threshold for search results."},
+                },
+                "description": "Ranking options for search results.",
+            }
+        },
+    },
+}
+
+@experimental
 class AIAgentConverter:
     """
     A converter for AI agent data.
@@ -192,6 +191,7 @@ def _extract_function_tool_definitions(thread_run: ThreadRun) -> List[ToolDefini
         """
         final_tools: List[ToolDefinition] = []
         for tool in thread_run.tools:
+            # Here we handle the custom functions and create tool definitions out of them.
             if tool.type == _FUNCTION:
                 tool_function: FunctionDefinition = tool.function
                 parameters = tool_function.parameters
@@ -208,19 +208,16 @@ def _extract_function_tool_definitions(thread_run: ThreadRun) -> List[ToolDefini
                     )
                 )
             else:
-                # add limited support for built-in tools.  Descriptions and parameters
+                # Add limited support for built-in tools. Descriptions and parameters
                 # are not published, but we'll include placeholders.
-                try:
+                if tool.type in _BUILT_IN_DESCRIPTIONS and tool.type in _BUILT_IN_PARAMS:
                     final_tools.append(
                         ToolDefinition(
                             name=tool.type,
                             description=_BUILT_IN_DESCRIPTIONS[tool.type],
-                            parameters=_BUILT_IN_PARAMS[tool.type]
+                            parameters=_BUILT_IN_PARAMS[tool.type],
                         )
                     )
-                except:
-                    # if we run into an unknown tool, don't fail
-                    pass
         return final_tools
 
     @staticmethod
diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_converters/_models.py
@@ -182,13 +182,15 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
         # Treat built-in tools separately.  Object models may be unique so handle each case separately
         # Just converting to dicts here rather than custom serializers for simplicity for now.
         # Don't fail if we run into a newly seen tool, just skip
-        if tool_call.details.type == "code_interpreter":
+        if tool_call.details["type"] == "code_interpreter":
             arguments = {"input": tool_call.details.code_interpreter.input}
-        elif tool_call.details.type == "bing_grounding":
-            arguments = {"requesturl": tool_call.details.bing_grounding.requesturl}
-        elif tool_call.details.type == "file_search":
-            options = tool_call.details.file_search.ranking_options
-            arguments = {"ranking_options": {"ranker": options.ranker, "score_threshold": options.score_threshold}}
+        elif tool_call.details["type"] == "bing_grounding":
+            arguments = {"requesturl": tool_call.details["bing_grounding"]["requesturl"]}
+        elif tool_call.details["type"] == "file_search":
+            options = tool_call.details["file_search"]["ranking_options"]
+            arguments = {
+                "ranking_options": {"ranker": options["ranker"], "score_threshold": options["score_threshold"]}
+            }
         else:
             # unsupported tool type, skip
             return messages
@@ -218,9 +220,17 @@ def break_tool_call_into_messages(tool_call: ToolCall, run_id: str) -> List[Mess
             if tool_call.details.type == "code_interpreter":
                 output = tool_call.details.code_interpreter.outputs
             elif tool_call.details.type == "bing_grounding":
-                return messages # not supported yet from bing grounding tool
+                return messages  # not supported yet from bing grounding tool
             elif tool_call.details.type == "file_search":
-                output = [{"file_id": result.file_id, "file_name": result.file_name, "score": result.score, "content": result.content} for result in tool_call.details.file_search.results]
+                output = [
+                    {
+                        "file_id": result.file_id,
+                        "file_name": result.file_name,
+                        "score": result.score,
+                        "content": result.content,
+                    }
+                    for result in tool_call.details.file_search.results
+                ]
         except:
             return messages
 
diff --git a/sdk/evaluation/azure-ai-evaluation/tests/converters/ai_agent_converter/serialization_helper.py b/sdk/evaluation/azure-ai-evaluation/tests/converters/ai_agent_converter/serialization_helper.py
@@ -0,0 +1,110 @@
+# Serialization and deserialization helper functions to make test-writing easier
+
+import json
+from datetime import datetime
+
+from azure.ai.evaluation._converters._models import ToolCall
+from azure.ai.projects.models import RunStepCodeInterpreterToolCall, RunStepCodeInterpreterToolCallDetails, \
+    RunStepFileSearchToolCall, RunStepFileSearchToolCallResults, RunStepFileSearchToolCallResult, \
+    FileSearchRankingOptions, RunStepBingGroundingToolCall
+
+
+class ToolDecoder(json.JSONDecoder):
+    def __init__(self, *args, **kwargs):
+        super().__init__(object_hook=self.object_hook, *args, **kwargs)
+
+    def object_hook(self, obj):
+        if 'completed' in obj and 'created' in obj and 'details' in obj:
+            return ToolCall(
+                created=datetime.fromisoformat(obj['created']),
+                completed=datetime.fromisoformat(obj['completed']),
+                details=self.decode_details(obj['details'])
+            )
+        return obj
+
+    def decode_details(self, details):
+        if 'id' in details and 'type' in details:
+            if details['type'] == 'code_interpreter':
+                return RunStepCodeInterpreterToolCall(
+                    id=details['id'],
+                    code_interpreter=RunStepCodeInterpreterToolCallDetails(
+                        input=details['code_interpreter']['input'],
+                        outputs=details['code_interpreter']['outputs']
+                    )
+                )
+            elif details['type'] == 'file_search':
+                return RunStepFileSearchToolCall(
+                    id=details['id'],
+                    file_search=RunStepFileSearchToolCallResults(
+                        results=[
+                            RunStepFileSearchToolCallResult(
+                                file_name=result['file_name'],
+                                file_id=result['file_id'],
+                                score=result['score'],
+                                content=result['content']
+                            ) for result in details['file_search']['results']
+                        ],
+                        ranking_options=FileSearchRankingOptions(
+                            ranker=details['file_search']['ranking_options']['ranker'],
+                            score_threshold=details['file_search']['ranking_options']['score_threshold']
+                        )
+                    )
+                )
+            elif details['type'] == 'bing_grounding':
+                return RunStepBingGroundingToolCall(
+                    id=details['id'],
+                    bing_grounding=details['bing_grounding']
+                )
+        return details
+
+class ToolEncoder(json.JSONEncoder):
+    def default(self, obj):
+        if isinstance(obj, datetime):
+            return obj.isoformat()
+        if isinstance(obj, ToolCall):
+            return {
+                "completed": obj.completed,
+                "created": obj.created,
+                "details": obj.details
+            }
+        if isinstance(obj, RunStepCodeInterpreterToolCall):
+            return {
+                "id": obj.id,
+                # "type": obj.type,
+                "code_interpreter": obj.code_interpreter
+            }
+        if isinstance(obj, RunStepCodeInterpreterToolCallDetails):
+            return {
+                "input": obj.input,
+                "outputs": obj.outputs
+            }
+        if isinstance(obj, RunStepFileSearchToolCall):
+            return {
+                "id": obj.id,
+                # "type": obj.type,
+                "file_search": obj.file_search
+            }
+        if isinstance(obj, RunStepFileSearchToolCallResults):
+            return {
+                "results": obj.results,
+                "ranking_options": obj.ranking_options
+            }
+        if isinstance(obj, RunStepFileSearchToolCallResult):
+            return {
+                "file_name": obj.file_name,
+                "file_id": obj.file_id,
+                "score": obj.score,
+                "content": obj.content
+            }
+        if isinstance(obj, FileSearchRankingOptions):
+            return {
+                "ranker": obj.ranker,
+                "score_threshold": obj.score_threshold
+            }
+        if isinstance(obj, RunStepBingGroundingToolCall):
+            return {
+                "id": obj.id,
+                # "type": obj.type,
+                "bing_grounding": obj.bing_grounding
+            }
+        return super().default(obj)
diff --git a/sdk/evaluation/azure-ai-evaluation/tests/converters/ai_agent_converter/test_ai_agent_converter_internals.py b/sdk/evaluation/azure-ai-evaluation/tests/converters/ai_agent_converter/test_ai_agent_converter_internals.py