Skip to content

Conversation

@dwnoble
Copy link
Contributor

@dwnoble dwnoble commented Oct 17, 2025

This PR refactors our agent evaluations to provide persistent reports for all eval runs.

Previously, the AgentEvaluator would run evaluations and immediately fail the pytest run if any evaluation failed. Results were only printed to stdout, making it difficult to debug or see a consolidated view of all test outcomes.

New approach:

  1. New evaluation tool based on the AgentEvaluator from ADK: A new, local version of the AgentEvaluator is added. This new evaluator is updated to run tests and return a pd.DataFrame of the results, rather than failing the test run directly.
  2. Decouples Testing from Reporting: The pytest test (test_tools.py) now:
    • Receives the DataFrame from the evaluator.
    • Asserts against the DataFrame and uses pytest.fail() to fail the test (if necessary).
    • Collects the DataFrame from all parameterized test runs into a class variable.
  3. Generates Full Reports: A teardown_class method in pytest combines all collected DataFrames into a single, comprehensive report. This report is saved as both a styled HTML file and a CSV file.
  4. Uploads Artifacts: The CI workflow (evals.yaml) is updated to upload these reports as a build artifact on every run (even on failure) using if: always().

This gives us visibility into all evaluation results, making it possible to track pass/fail rates, see average scores, and debug specific failures by inspecting the full context.

Unrelated changes

This PR also adds an agent instruction for lowercasing indicator names in tool calls. This improves overall agent evaluation scores, as we were finding some failed evaluations due to capitalized indicator names.

1. **Indicator Name Lowercased**: Ensure that indicator related arguments like `indicator_name` are never capitalized in tool calls. For example, use "query": "population" instead of "query": "Population".

📊 How to View the Report

  1. Wait for the "Run Agent Evals" CI check to complete.
  2. Go to the "View details" page for the workflow run.
  3. Scroll down to the "Upload Evaluation Reports" section.
  4. Use the Artifact Download URL to download the agent-eval-reports.
  5. Unzip the file and open the evaluation-report-....html file in your browser.

View job details

open-eval-details

Download report artifacts

download-evals

View report

evals

Example artifacts: https://github.com/dwnoble/agent-toolkit/actions/runs/19450027468/artifacts/4595914364

@dwnoble dwnoble marked this pull request as ready for review October 17, 2025 21:32
@dwnoble dwnoble requested review from clincoln8 and keyurva October 17, 2025 21:32
@dwnoble
Copy link
Contributor Author

dwnoble commented Nov 27, 2025

Thanks Dan!!

The folder structure is a little unintuitive for me. Here's one suggestion for reorganizing, this could be a follow-up or ignored if it doesn't sound good.

evals/
 |--evaluator_framework/                       # Core infrastructure for running agents and validations
 |   |--runner.py               # Handles the lifecycle: spins up the agent, mocks the MCP server, and captures tool calls
 |   |--evaluator.py            # Logic to compare the agent's actual tool calls against the expected schema/behavior
 |   |--base_test.py            # Base class containing common setup/teardown logic (e.g., initializing the runner)
 |   |--types.py                # Pydantic data models
 |
 |--tool_call_evals/            # Implementation of specific test suites for tool usage
     |--data/
     |   |--get_observations/
     |   |   |--place_params.test.json
     |   |
     |   |--orchestration/
     |       |--search_then_fetch.test.json
     |
     |--prompts.py              # Stores system instructions
     |--test_tool_usage.py      # The entry point that creates the agent, iterates through the data files and runs each test; inherits from base_test class

Agreed my structure is a little unclear. I played around with it, and using your suggestion as a guide came up with:

evals/
 |--evaluator_framework/  
 |   |--runner.py
 |   |--evaluator.py
 |   |--types.py
 |
 |--tool_call_evals/
     |--agent.py
     |--instructions.py
     |--test_tool_usage.py
     |
     |--data/
         |--get_observations/
         |   |--date_params.test.json
         |   |--place_params.test.json
         |   |--source_params.test.json
         |
         |--orchestration/
             |--search_then_fetch.test.json

Notes:

  • I ended up leaving out the base_test class for now since there's only a single test file
  • i kept an agent.py file, and the instructions.py file, but i'm happy to move those around if you feel strongly

Wdyt?

@dwnoble
Copy link
Contributor Author

dwnoble commented Nov 27, 2025

Thanks @clincoln8 !

I made those changes, and also:

  • fixed a bug where multiple runs were re-using the same agent session
  • Updated agent evaluator to run repeated agent runs in parallel

@clincoln8 clincoln8 self-requested a review December 1, 2025 17:45
Copy link
Contributor

@clincoln8 clincoln8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woohoo, thank you!

evaluation_score: EvaluationScore


class EvaluationDataFrameRow(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the amount of hardcoded field names, you could do something like the snippet below where you label each field with extra json and add a class method to retrieve those field names.

But that might be over complicating this and leaving as hardcoded strings might be the simpler option?

from typing import Annotated, List
from pydantic import BaseModel, Field

# 1. Define Reusable Tagged Types
StatusStr = Annotated[str | None, Field(json_schema_extra={"style": "status"})]
ScoreFloat = Annotated[float | None, Field(json_schema_extra={"style": "score", "format_str": "{:.3f}"})]
ThresholdFloat= Annotated[float | None, Field(json_schema_extra={"format_str": ":.3f"})]

class EvaluationDataFrameRow(BaseModel):
    overall_eval_status: StatusStr
    # ...
    average_tool_call_score: ScoreFloat
    # ...
    tool_call_score_threshold: ThresholdFloat
    
    # 2. Define the helper as a class method
    @classmethod
    def get_field_names(cls, tag_key: str, tag_value: str) -> List[str]:
       """Returns list of column names matching a metadata key/value pair."""
        return [
            name for name, field_info in cls.model_fields.items()
            if field_info.json_schema_extra and field_info.json_schema_extra.get(tag_key) == tag_value
        ]

    @classmethod
    def get_metadata_map(cls, key: str) -> Dict[str, Any]:
        """
        Generic extractor: Returns a dict of {field_name: tag_value} 
        for any field that has the specified metadata key.
        """
        result = {}
        for name, field in cls.model_fields.items():
            if field.json_schema_extra:
                val = field.json_schema_extra.get(key)
                if val is not None:
                    result[name] = val
        return result

# --- Usage ---

format_dict=EvaluationDataFrameRow.get_metadata_map("format_str")}
# ...
df.style.apply(
                    style_status,
                    subset=[
                     EvaluationDataFrameRow.get_field_names("style", "status")
                    ],
                )
# ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this as inspiration and added some enums to defint eh "style", "status", and "format_str" strings as well. let me know what you think. i also tacked on a sticky header to the report (the header scrolling out of view had been bothering me before)

pyproject.toml Outdated
]
"TC002", # Ignore unused "if TYPE_CHECKING" imports in test files
]
"packages/*/evals/**/*.py" = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
"packages/*/evals/**/*.py" = [
"packages/*/evals/**/test_*.py" = [

If these exceptions are needed for the eval framework files, then update the comments to reflect why we should ignore these for the files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. These exceptions are only required for test cases. Updated the type checking rule along with some fixes in the eval framework that were previously skipped by ruff.

pyproject.toml Outdated
"ANN001", # Ignore missing type annotations in test files
"ANN201", # Ignore missing return types in public test functions
"ANN202", # Ignore missing return types in private test functions,
"TC002", # Ignore unused "if TYPE_CHECKING" imports in test files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooc, why ignore TC002 in test files? was there a particular example that was annoying?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this line and the unit tests & evals still ran, so i'll tack that onto this PR.

Co-authored-by: Christie Ellks <calinc@google.com>
Copy link
Contributor

@clincoln8 clincoln8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!!

@dwnoble dwnoble merged commit d624acc into datacommonsorg:main Dec 8, 2025
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants