-
Notifications
You must be signed in to change notification settings - Fork 1.3k
DRAFT: Add generate_dataset_from_agent to Pydantic Evals #3187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Raises: | ||
ValidationError: If the LLM's response cannot be parsed. | ||
""" | ||
# Get output schema with proper type handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is going to cover a lot of real-world cases. We'll need a way to actually get an agent's complete output schema: #3076 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would be way better! Are you on that or open for a contribution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benomahony I just created an issue for it: #3225. Definitely open to a contribution!
generation_prompt = ( | ||
f'Generate {n_examples} test case inputs for an agent.\n\n' | ||
f'The agent accepts inputs of type: {inputs_schema}\n' | ||
f'The agent produces outputs of type: {output_schema}\n\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed at all?
if not output: | ||
raise ValueError('Empty output after stripping markdown fences') | ||
|
||
# Additional cleanup in case strip_markdown_fences didn't catch everything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we extend strip_markdown_fences
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f'Generate {n_examples} test case inputs for an agent.\n\n' | ||
f'The agent accepts inputs of type: {inputs_schema}\n' | ||
f'The agent produces outputs of type: {output_schema}\n\n' | ||
f'Return a JSON array of objects with "name" (optional string), "inputs" (matching the input type), ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a structured output_type
built using inputs_type
instead?
f'The agent accepts inputs of type: {inputs_schema}\n' | ||
f'The agent produces outputs of type: {output_schema}\n\n' | ||
f'Return a JSON array of objects with "name" (optional string), "inputs" (matching the input type), ' | ||
f'and "metadata" (optional, any additional context).\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to the model what it should use metadata for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I didn't know how to make this work generically. Maybe we can discard for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok by me
Case( | ||
name=item.get('name', f'case-{i}'), | ||
inputs=cast(InputsT, item['inputs']), | ||
expected_output=agent_result.output, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems odd, as the goal is typically to define expected outputs manually and then evaluate whether the agent run matches them. Now we're just assuming the agent's current responses are the desired responses. I could see that being useful to get a dataset going, but then we'd want to document that you should verify and modify the expected outputs yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my use case is to bootstrap a dataset for a very complex model (https://github.com/bitol-io/open-data-contract-standard/blob/main/schema/odcs-json-schema-latest.json -> 600 lines of pydantic)
Creating the structure by hand is painful so I thought a nice helper method would help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of a helper method for creating cases, we just can't assume the agent's current output will actually be the expected output, so we should make it clear the user is meant to review this manually.
def test_import_generate_dataset_from_agent(): | ||
# this function is tough to test in an interesting way outside an example... | ||
# this at least ensures importing it doesn't fail. | ||
# TODO: Add an "example" that actually makes use of this functionality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need a real test. What makes it tough to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I saw the test above and copied it ;)
Will take a look at tests for both
Adds a convenience function that simplifies generating evaluation datasets by automatically extracting types from an existing agent.
Example Usage:
Key benefit: No need to manually specify dataset_type=Dataset[InputsT, OutputT, MetadataT] - the function automatically infers types from your agent. It generates diverse test inputs with an LLM, then runs them through your actual agent to capture real outputs as expected results.Retry