DRAFT: Add generate_dataset_from_agent to Pydantic Evals #3187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

benomahony wants to merge 5 commits into pydantic:main from benomahony:main

+164 −7

benomahony commented Oct 16, 2025

Adds a convenience function that simplifies generating evaluation datasets by automatically extracting types from an existing agent.

Example Usage:

from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_evals.generation import generate_evals_from_agent

class AnswerOutput(BaseModel):
    answer: str
    confidence: float

agent = Agent(
    'openai:gpt-4o',
    output_type=AnswerOutput,
    system_prompt='You answer questions about world geography.',
)

# Types are automatically extracted from the agent
dataset = await generate_evals_from_agent(
    agent=agent,
    n_examples=5,
    model='openai:gpt-4o',
    path='test_cases.json'
)

Key benefit: No need to manually specify dataset_type=Dataset[InputsT, OutputT, MetadataT] - the function automatically infers types from your agent. It generates diverse test inputs with an LLM, then runs them through your actual agent to capture real outputs as expected results.Retry

benomahony added 2 commits

October 16, 2025 16:23


          Add generate_dataset_from_agent

7bb085c


          Add docs

95dbb2b

benomahony force-pushed the main branch from d8bccb6 to f4c8115 Compare

October 16, 2025 15:41


          Fix typing for lower python versions

a796233

benomahony force-pushed the main branch from f4c8115 to a796233 Compare

October 16, 2025 15:53


          WIP Fixing docs and more typing

b70edef

DouweM requested changes

View reviewed changes

docs/evals.md Show resolved Hide resolved

pydantic_evals/pydantic_evals/generation.py

    
                  Raises:

                      ValidationError: If the LLM's response cannot be parsed.

                  """

                  # Get output schema with proper type handling

Collaborator

DouweM Oct 21, 2025

I don't think this is going to cover a lot of real-world cases. We'll need a way to actually get an agent's complete output schema: #3076 (comment).

Author

benomahony Oct 22, 2025

Yes, this would be way better! Are you on that or open for a contribution?

Collaborator

DouweM Oct 22, 2025

@benomahony I just created an issue for it: #3225. Definitely open to a contribution!

pydantic_evals/pydantic_evals/generation.py

    
                  generation_prompt = (

                      f'Generate {n_examples} test case inputs for an agent.\n\n'

                      f'The agent accepts inputs of type: {inputs_schema}\n'

                      f'The agent produces outputs of type: {output_schema}\n\n'

Collaborator

DouweM Oct 21, 2025

Is this needed at all?

pydantic_evals/pydantic_evals/generation.py

    
                      if not output:

                          raise ValueError('Empty output after stripping markdown fences')

                      # Additional cleanup in case strip_markdown_fences didn't catch everything

Collaborator

DouweM Oct 21, 2025

Should we extend strip_markdown_fences?

Author

benomahony Oct 22, 2025

@DouweM I've moved that into a separate PR: #3222
This now works with newlines and whitespace

pydantic_evals/pydantic_evals/generation.py

    
                      f'Generate {n_examples} test case inputs for an agent.\n\n'

                      f'The agent accepts inputs of type: {inputs_schema}\n'

                      f'The agent produces outputs of type: {output_schema}\n\n'

                      f'Return a JSON array of objects with "name" (optional string), "inputs" (matching the input type), '

Collaborator

DouweM Oct 21, 2025

Can we use a structured output_type built using inputs_type instead?

pydantic_evals/pydantic_evals/generation.py

    
                      f'The agent accepts inputs of type: {inputs_schema}\n'

                      f'The agent produces outputs of type: {output_schema}\n\n'

                      f'Return a JSON array of objects with "name" (optional string), "inputs" (matching the input type), '

                      f'and "metadata" (optional, any additional context).\n'

Collaborator

DouweM Oct 21, 2025

It's unclear to the model what it should use metadata for.

Author

benomahony Oct 22, 2025

Yeah I didn't know how to make this work generically. Maybe we can discard for now?

Collaborator

DouweM Oct 22, 2025

Ok by me

pydantic_evals/pydantic_evals/generation.py

    
                              Case(

                                  name=item.get('name', f'case-{i}'),

                                  inputs=cast(InputsT, item['inputs']),

                                  expected_output=agent_result.output,

Collaborator

DouweM Oct 21, 2025

This seems odd, as the goal is typically to define expected outputs manually and then evaluate whether the agent run matches them. Now we're just assuming the agent's current responses are the desired responses. I could see that being useful to get a dataset going, but then we'd want to document that you should verify and modify the expected outputs yourself.

Author

benomahony Oct 22, 2025

So my use case is to bootstrap a dataset for a very complex model (https://github.com/bitol-io/open-data-contract-standard/blob/main/schema/odcs-json-schema-latest.json -> 600 lines of pydantic)

Creating the structure by hand is painful so I thought a nice helper method would help!

Collaborator

DouweM Oct 22, 2025

I like the idea of a helper method for creating cases, we just can't assume the agent's current output will actually be the expected output, so we should make it clear the user is meant to review this manually.

tests/evals/test_dataset.py

    
              def test_import_generate_dataset_from_agent():

                  # this function is tough to test in an interesting way outside an example...

                  # this at least ensures importing it doesn't fail.

                  # TODO: Add an "example" that actually makes use of this functionality

Collaborator

DouweM Oct 21, 2025

We do need a real test. What makes it tough to test?

Author

benomahony Oct 22, 2025

Honestly I saw the test above and copied it ;)

Will take a look at tests for both

DouweM self-assigned this

DouweM added the awaiting author revision label

benomahony changed the title ~~Add generate_dataset_from_agent to Pydantic Evals~~ DRAFT: Add generate_dataset_from_agent to Pydantic Evals


          Merge branch 'pydantic:main' into main

8eb6f05

DouweM mentioned this pull request

Add method to get an agent's output JSON schema #3225

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting author revision