Skip to content

Conversation

@benomahony
Copy link

Adds a convenience function that simplifies generating evaluation datasets by automatically extracting types from an existing agent.

Example Usage:

from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_evals.generation import generate_evals_from_agent

class AnswerOutput(BaseModel):
    answer: str
    confidence: float

agent = Agent(
    'openai:gpt-4o',
    output_type=AnswerOutput,
    system_prompt='You answer questions about world geography.',
)

# Types are automatically extracted from the agent
dataset = await generate_evals_from_agent(
    agent=agent,
    n_examples=5,
    model='openai:gpt-4o',
    path='test_cases.json'
)

Key benefit: No need to manually specify dataset_type=Dataset[InputsT, OutputT, MetadataT] - the function automatically infers types from your agent. It generates diverse test inputs with an LLM, then runs them through your actual agent to capture real outputs as expected results.Retry

Raises:
ValidationError: If the LLM's response cannot be parsed.
"""
# Get output schema with proper type handling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is going to cover a lot of real-world cases. We'll need a way to actually get an agent's complete output schema: #3076 (comment).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would be way better! Are you on that or open for a contribution?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benomahony I just created an issue for it: #3225. Definitely open to a contribution!

generation_prompt = (
f'Generate {n_examples} test case inputs for an agent.\n\n'
f'The agent accepts inputs of type: {inputs_schema}\n'
f'The agent produces outputs of type: {output_schema}\n\n'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed at all?

if not output:
raise ValueError('Empty output after stripping markdown fences')

# Additional cleanup in case strip_markdown_fences didn't catch everything
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extend strip_markdown_fences?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DouweM I've moved that into a separate PR: #3222
This now works with newlines and whitespace

f'Generate {n_examples} test case inputs for an agent.\n\n'
f'The agent accepts inputs of type: {inputs_schema}\n'
f'The agent produces outputs of type: {output_schema}\n\n'
f'Return a JSON array of objects with "name" (optional string), "inputs" (matching the input type), '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a structured output_type built using inputs_type instead?

f'The agent accepts inputs of type: {inputs_schema}\n'
f'The agent produces outputs of type: {output_schema}\n\n'
f'Return a JSON array of objects with "name" (optional string), "inputs" (matching the input type), '
f'and "metadata" (optional, any additional context).\n'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to the model what it should use metadata for.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I didn't know how to make this work generically. Maybe we can discard for now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok by me

Case(
name=item.get('name', f'case-{i}'),
inputs=cast(InputsT, item['inputs']),
expected_output=agent_result.output,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems odd, as the goal is typically to define expected outputs manually and then evaluate whether the agent run matches them. Now we're just assuming the agent's current responses are the desired responses. I could see that being useful to get a dataset going, but then we'd want to document that you should verify and modify the expected outputs yourself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my use case is to bootstrap a dataset for a very complex model (https://github.com/bitol-io/open-data-contract-standard/blob/main/schema/odcs-json-schema-latest.json -> 600 lines of pydantic)

Creating the structure by hand is painful so I thought a nice helper method would help!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of a helper method for creating cases, we just can't assume the agent's current output will actually be the expected output, so we should make it clear the user is meant to review this manually.

def test_import_generate_dataset_from_agent():
# this function is tough to test in an interesting way outside an example...
# this at least ensures importing it doesn't fail.
# TODO: Add an "example" that actually makes use of this functionality
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need a real test. What makes it tough to test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I saw the test above and copied it ;)

Will take a look at tests for both

@DouweM DouweM self-assigned this Oct 21, 2025
@benomahony benomahony changed the title Add generate_dataset_from_agent to Pydantic Evals DRAFT: Add generate_dataset_from_agent to Pydantic Evals Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants