-
Notifications
You must be signed in to change notification settings - Fork 31
Generate and upload HTML/CSV agent evaluation reports #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…taframes rather than just passing/failing the tests. Added a hook to write test results to csv and html
Agreed my structure is a little unclear. I played around with it, and using your suggestion as a guide came up with: Notes:
Wdyt? |
|
Thanks @clincoln8 ! I made those changes, and also:
|
clincoln8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woohoo, thank you!
| evaluation_score: EvaluationScore | ||
|
|
||
|
|
||
| class EvaluationDataFrameRow(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce the amount of hardcoded field names, you could do something like the snippet below where you label each field with extra json and add a class method to retrieve those field names.
But that might be over complicating this and leaving as hardcoded strings might be the simpler option?
from typing import Annotated, List
from pydantic import BaseModel, Field
# 1. Define Reusable Tagged Types
StatusStr = Annotated[str | None, Field(json_schema_extra={"style": "status"})]
ScoreFloat = Annotated[float | None, Field(json_schema_extra={"style": "score", "format_str": "{:.3f}"})]
ThresholdFloat= Annotated[float | None, Field(json_schema_extra={"format_str": ":.3f"})]
class EvaluationDataFrameRow(BaseModel):
overall_eval_status: StatusStr
# ...
average_tool_call_score: ScoreFloat
# ...
tool_call_score_threshold: ThresholdFloat
# 2. Define the helper as a class method
@classmethod
def get_field_names(cls, tag_key: str, tag_value: str) -> List[str]:
"""Returns list of column names matching a metadata key/value pair."""
return [
name for name, field_info in cls.model_fields.items()
if field_info.json_schema_extra and field_info.json_schema_extra.get(tag_key) == tag_value
]
@classmethod
def get_metadata_map(cls, key: str) -> Dict[str, Any]:
"""
Generic extractor: Returns a dict of {field_name: tag_value}
for any field that has the specified metadata key.
"""
result = {}
for name, field in cls.model_fields.items():
if field.json_schema_extra:
val = field.json_schema_extra.get(key)
if val is not None:
result[name] = val
return result
# --- Usage ---
format_dict=EvaluationDataFrameRow.get_metadata_map("format_str")}
# ...
df.style.apply(
style_status,
subset=[
EvaluationDataFrameRow.get_field_names("style", "status")
],
)
# ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this as inspiration and added some enums to defint eh "style", "status", and "format_str" strings as well. let me know what you think. i also tacked on a sticky header to the report (the header scrolling out of view had been bothering me before)
packages/datacommons-mcp/evals/tool_call_evals/test_tool_usage.py
Outdated
Show resolved
Hide resolved
pyproject.toml
Outdated
| ] | ||
| "TC002", # Ignore unused "if TYPE_CHECKING" imports in test files | ||
| ] | ||
| "packages/*/evals/**/*.py" = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| "packages/*/evals/**/*.py" = [ | |
| "packages/*/evals/**/test_*.py" = [ |
If these exceptions are needed for the eval framework files, then update the comments to reflect why we should ignore these for the files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. These exceptions are only required for test cases. Updated the type checking rule along with some fixes in the eval framework that were previously skipped by ruff.
pyproject.toml
Outdated
| "ANN001", # Ignore missing type annotations in test files | ||
| "ANN201", # Ignore missing return types in public test functions | ||
| "ANN202", # Ignore missing return types in private test functions, | ||
| "TC002", # Ignore unused "if TYPE_CHECKING" imports in test files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooc, why ignore TC002 in test files? was there a particular example that was annoying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this line and the unit tests & evals still ran, so i'll tack that onto this PR.
Co-authored-by: Christie Ellks <calinc@google.com>
clincoln8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!!
This PR refactors our agent evaluations to provide persistent reports for all eval runs.
Previously, the
AgentEvaluatorwould run evaluations and immediately fail thepytestrun if any evaluation failed. Results were only printed tostdout, making it difficult to debug or see a consolidated view of all test outcomes.New approach:
AgentEvaluatorfrom ADK: A new, local version of theAgentEvaluatoris added. This new evaluator is updated to run tests and return apd.DataFrameof the results, rather than failing the test run directly.pytesttest (test_tools.py) now:DataFramefrom the evaluator.pytest.fail()to fail the test (if necessary).DataFramefrom all parameterized test runs into a class variable.teardown_classmethod inpytestcombines all collected DataFrames into a single, comprehensive report. This report is saved as both a styled HTML file and a CSV file.evals.yaml) is updated to upload these reports as a build artifact on every run (even on failure) usingif: always().This gives us visibility into all evaluation results, making it possible to track pass/fail rates, see average scores, and debug specific failures by inspecting the full context.
Unrelated changes
This PR also adds an agent instruction for lowercasing indicator names in tool calls. This improves overall agent evaluation scores, as we were finding some failed evaluations due to capitalized indicator names.
📊 How to View the Report
Artifact Download URLto download theagent-eval-reports.evaluation-report-....htmlfile in your browser.View job details
Download report artifacts
View report
Example artifacts: https://github.com/dwnoble/agent-toolkit/actions/runs/19450027468/artifacts/4595914364