Generate and upload HTML/CSV agent evaluation reports #109

dwnoble · 2025-10-17T21:11:24Z

This PR refactors our agent evaluations to provide persistent reports for all eval runs.

Previously, the AgentEvaluator would run evaluations and immediately fail the pytest run if any evaluation failed. Results were only printed to stdout, making it difficult to debug or see a consolidated view of all test outcomes.

New approach:

New evaluation tool based on the AgentEvaluator from ADK: A new, local version of the AgentEvaluator is added. This new evaluator is updated to run tests and return a pd.DataFrame of the results, rather than failing the test run directly.
Decouples Testing from Reporting: The pytest test (test_tools.py) now:
- Receives the DataFrame from the evaluator.
- Asserts against the DataFrame and uses pytest.fail() to fail the test (if necessary).
- Collects the DataFrame from all parameterized test runs into a class variable.
Generates Full Reports: A teardown_class method in pytest combines all collected DataFrames into a single, comprehensive report. This report is saved as both a styled HTML file and a CSV file.
Uploads Artifacts: The CI workflow (evals.yaml) is updated to upload these reports as a build artifact on every run (even on failure) using if: always().

This gives us visibility into all evaluation results, making it possible to track pass/fail rates, see average scores, and debug specific failures by inspecting the full context.

Unrelated changes

This PR also adds an agent instruction for lowercasing indicator names in tool calls. This improves overall agent evaluation scores, as we were finding some failed evaluations due to capitalized indicator names.

1. **Indicator Name Lowercased**: Ensure that indicator related arguments like `indicator_name` are never capitalized in tool calls. For example, use "query": "population" instead of "query": "Population".

📊 How to View the Report

Wait for the "Run Agent Evals" CI check to complete.
Go to the "View details" page for the workflow run.
Scroll down to the "Upload Evaluation Reports" section.
Use the Artifact Download URL to download the agent-eval-reports.
Unzip the file and open the evaluation-report-....html file in your browser.

View job details

Download report artifacts

View report

Example artifacts: https://github.com/dwnoble/agent-toolkit/actions/runs/19450027468/artifacts/4595914364

…taframes rather than just passing/failing the tests. Added a hook to write test results to csv and html

dwnoble · 2025-11-27T03:28:12Z

Thanks Dan!!

The folder structure is a little unintuitive for me. Here's one suggestion for reorganizing, this could be a follow-up or ignored if it doesn't sound good.

evals/
 |--evaluator_framework/                       # Core infrastructure for running agents and validations
 |   |--runner.py               # Handles the lifecycle: spins up the agent, mocks the MCP server, and captures tool calls
 |   |--evaluator.py            # Logic to compare the agent's actual tool calls against the expected schema/behavior
 |   |--base_test.py            # Base class containing common setup/teardown logic (e.g., initializing the runner)
 |   |--types.py                # Pydantic data models
 |
 |--tool_call_evals/            # Implementation of specific test suites for tool usage
     |--data/
     |   |--get_observations/
     |   |   |--place_params.test.json
     |   |
     |   |--orchestration/
     |       |--search_then_fetch.test.json
     |
     |--prompts.py              # Stores system instructions
     |--test_tool_usage.py      # The entry point that creates the agent, iterates through the data files and runs each test; inherits from base_test class

Agreed my structure is a little unclear. I played around with it, and using your suggestion as a guide came up with:

evals/
 |--evaluator_framework/  
 |   |--runner.py
 |   |--evaluator.py
 |   |--types.py
 |
 |--tool_call_evals/
     |--agent.py
     |--instructions.py
     |--test_tool_usage.py
     |
     |--data/
         |--get_observations/
         |   |--date_params.test.json
         |   |--place_params.test.json
         |   |--source_params.test.json
         |
         |--orchestration/
             |--search_then_fetch.test.json

Notes:

I ended up leaving out the base_test class for now since there's only a single test file
i kept an agent.py file, and the instructions.py file, but i'm happy to move those around if you feel strongly

Wdyt?

dwnoble · 2025-11-27T03:29:25Z

Thanks @clincoln8 !

I made those changes, and also:

fixed a bug where multiple runs were re-using the same agent session
Updated agent evaluator to run repeated agent runs in parallel

…on test case

…val-reports

clincoln8

Woohoo, thank you!

clincoln8 · 2025-12-01T19:43:59Z

packages/datacommons-mcp/evals/evaluator_framework/types.py

    evaluation_score: EvaluationScore


+class EvaluationDataFrameRow(BaseModel):


To reduce the amount of hardcoded field names, you could do something like the snippet below where you label each field with extra json and add a class method to retrieve those field names.

But that might be over complicating this and leaving as hardcoded strings might be the simpler option?

from typing import Annotated, List from pydantic import BaseModel, Field # 1. Define Reusable Tagged Types StatusStr = Annotated[str | None, Field(json_schema_extra={"style": "status"})] ScoreFloat = Annotated[float | None, Field(json_schema_extra={"style": "score", "format_str": "{:.3f}"})] ThresholdFloat= Annotated[float | None, Field(json_schema_extra={"format_str": ":.3f"})] class EvaluationDataFrameRow(BaseModel): overall_eval_status: StatusStr # ... average_tool_call_score: ScoreFloat # ... tool_call_score_threshold: ThresholdFloat # 2. Define the helper as a class method @classmethod def get_field_names(cls, tag_key: str, tag_value: str) -> List[str]: """Returns list of column names matching a metadata key/value pair.""" return [ name for name, field_info in cls.model_fields.items() if field_info.json_schema_extra and field_info.json_schema_extra.get(tag_key) == tag_value ] @classmethod def get_metadata_map(cls, key: str) -> Dict[str, Any]: """ Generic extractor: Returns a dict of {field_name: tag_value} for any field that has the specified metadata key. """ result = {} for name, field in cls.model_fields.items(): if field.json_schema_extra: val = field.json_schema_extra.get(key) if val is not None: result[name] = val return result # --- Usage --- format_dict=EvaluationDataFrameRow.get_metadata_map("format_str")} # ... df.style.apply( style_status, subset=[ EvaluationDataFrameRow.get_field_names("style", "status") ], ) # ...

I took this as inspiration and added some enums to defint eh "style", "status", and "format_str" strings as well. let me know what you think. i also tacked on a sticky header to the report (the header scrolling out of view had been bothering me before)

packages/datacommons-mcp/evals/tool_call_evals/test_tool_usage.py

clincoln8 · 2025-12-01T19:49:57Z

pyproject.toml

-]
+    "TC002", # Ignore unused "if TYPE_CHECKING" imports in test files
+]
+"packages/*/evals/**/*.py" = [


nit

Suggested change

"packages/*/evals/**/*.py" = [

"packages/*/evals/**/test_*.py" = [

If these exceptions are needed for the eval framework files, then update the comments to reflect why we should ignore these for the files

Good catch. These exceptions are only required for test cases. Updated the type checking rule along with some fixes in the eval framework that were previously skipped by ruff.

clincoln8 · 2025-12-01T19:52:39Z

pyproject.toml

+    "ANN001", # Ignore missing type annotations in test files
+    "ANN201", # Ignore missing return types in public test functions
+    "ANN202", # Ignore missing return types in private test functions,
+    "TC002", # Ignore unused "if TYPE_CHECKING" imports in test files


ooc, why ignore TC002 in test files? was there a particular example that was annoying?

I removed this line and the unit tests & evals still ran, so i'll tack that onto this PR.

Co-authored-by: Christie Ellks <calinc@google.com>

clincoln8

Looks great, thank you!!

dwnoble added 11 commits October 9, 2025 17:28

Added customized adk evaluation class that returns eval results as da…

05c5859

…taframes rather than just passing/failing the tests. Added a hook to write test results to csv and html

lint

854a5bb

ruff formatting fixes and rule additions

054ded3

fixes

6787a17

fixes

3519d45

tmp

4768e5b

added report uploading

922660d

updated eval workflow

63b6e8f

updated eval workflow

1c95b3b

updated evals trigger to pull_request_target

6bcc6e2

removed test eval configuration

072e535

dwnoble had a problem deploying to evals-and-secrets October 17, 2025 21:11 — with GitHub Actions Failure

dwnoble marked this pull request as ready for review October 17, 2025 21:32

dwnoble requested review from clincoln8 and keyurva October 17, 2025 21:32

comment

906aad5

dwnoble had a problem deploying to evals-and-secrets October 17, 2025 21:47 — with GitHub Actions Failure

dwnoble added 2 commits October 23, 2025 15:23

merged

55afd8b

added reports to gitignore

c4da3aa

clincoln8 removed request for clincoln8 and keyurva October 28, 2025 20:52

dwnoble added 2 commits November 6, 2025 17:01

wip

909f06a

Migrated new agent evaluator into the existing agent evaluator

72a1e1d

dwnoble had a problem deploying to evals-and-secrets November 18, 2025 00:22 — with GitHub Actions Error

merged

4071131

dwnoble had a problem deploying to evals-and-secrets November 18, 2025 00:24 — with GitHub Actions Failure

evals

e4846fb

dwnoble had a problem deploying to evals-and-secrets November 18, 2025 00:32 — with GitHub Actions Error

evals

ba89639

dwnoble had a problem deploying to evals-and-secrets November 18, 2025 00:42 — with GitHub Actions Error

dwnoble had a problem deploying to evals-and-secrets November 26, 2025 23:55 — with GitHub Actions Error

dwnoble added 7 commits November 26, 2025 19:35

Organized files, improved parallelization, added a sample orchestrati…

7bd568f

…on test case

formatting

7a23b5c

formatting

58caf85

formatting

3d5eec7

formatting

978485b

merged

5327a9f

Merge branch 'main' of github.com:datacommonsorg/agent-toolkit into e…

f856453

…val-reports

dwnoble had a problem deploying to evals-and-secrets November 27, 2025 03:42 — with GitHub Actions Error

clincoln8 self-requested a review December 1, 2025 17:45

clincoln8 approved these changes Dec 1, 2025

View reviewed changes

Update packages/datacommons-mcp/evals/tool_call_evals/test_tool_usage.py

2a6330b

Co-authored-by: Christie Ellks <calinc@google.com>

dwnoble had a problem deploying to evals-and-secrets December 1, 2025 21:19 — with GitHub Actions Error

ruff formatting improvements

c18e2c4

dwnoble had a problem deploying to evals-and-secrets December 1, 2025 21:56 — with GitHub Actions Error

Removed unused TC002 ruff check

65b35cc

dwnoble had a problem deploying to evals-and-secrets December 1, 2025 22:02 — with GitHub Actions Error

Updated report generation flow to use more strict type definitions

341686b

dwnoble had a problem deploying to evals-and-secrets December 2, 2025 00:13 — with GitHub Actions Error

formatting

a27ae7f

dwnoble had a problem deploying to evals-and-secrets December 2, 2025 00:14 — with GitHub Actions Failure

Merge branch 'main' into eval-reports

18e5d1a

dwnoble had a problem deploying to evals-and-secrets December 2, 2025 00:46 — with GitHub Actions Error

dwnoble requested a review from clincoln8 December 2, 2025 00:53

clincoln8 approved these changes Dec 3, 2025

View reviewed changes

Merge branch 'main' into eval-reports

d94b92d

dwnoble had a problem deploying to evals-and-secrets December 8, 2025 19:16 — with GitHub Actions Error

dwnoble merged commit d624acc into datacommonsorg:main Dec 8, 2025
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generate and upload HTML/CSV agent evaluation reports #109

Generate and upload HTML/CSV agent evaluation reports #109

Uh oh!

dwnoble commented Oct 17, 2025 •

edited

Loading

Uh oh!

dwnoble commented Nov 27, 2025 •

edited

Loading

Uh oh!

dwnoble commented Nov 27, 2025

Uh oh!

clincoln8 left a comment

Uh oh!

clincoln8 Dec 1, 2025

Uh oh!

dwnoble Dec 2, 2025

Uh oh!

Uh oh!

clincoln8 Dec 1, 2025

Uh oh!

dwnoble Dec 1, 2025

Uh oh!

clincoln8 Dec 1, 2025

Uh oh!

dwnoble Dec 1, 2025

Uh oh!

clincoln8 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		evaluation_score: EvaluationScore


		class EvaluationDataFrameRow(BaseModel):

	"packages//evals//.py" = [
	"packages//evals//test_.py" = [

Generate and upload HTML/CSV agent evaluation reports #109

Generate and upload HTML/CSV agent evaluation reports #109

Uh oh!

Conversation

dwnoble commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 How to View the Report

View job details

Download report artifacts

View report

Uh oh!

dwnoble commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwnoble commented Nov 27, 2025

Uh oh!

clincoln8 left a comment

Choose a reason for hiding this comment

Uh oh!

clincoln8 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

dwnoble Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clincoln8 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

dwnoble Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

clincoln8 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

dwnoble Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

clincoln8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dwnoble commented Oct 17, 2025 •

edited

Loading

dwnoble commented Nov 27, 2025 •

edited

Loading