-
Notifications
You must be signed in to change notification settings - Fork 811
Add pydantic-evals
package
#935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style guide flagged several spelling errors that seemed like false positives. We skipped posting inline suggestions for the following words:
- [Ee]vals
- Evals
- Pydantic
632efc1
to
38d1777
Compare
Docs Preview
|
PR Change SummaryIntroduced initial work on evals API and report generation functionality.
Added Files
How can I customize these reviews?Check out the Hyperlint AI Reviewer docs for more information on how to customize the review. If you just want to ignore it on this PR, you can add the Note specifically for link checks, we only check the first 30 links in a file and we cache the results for several hours (for instance, if you just added a page, you might experience this). Our recommendation is to add |
I especially like the output you are working on. I would like a clear default OTLP export of the traces made by the evaluator (forgive me if not already done, and I just don't see it) Here's a toy I am using in deepeval, which isn't as pretty as yours, but exports traces. Ignore that for convenience I'm testing main instead of a function. This is a toy. One thing I would like in any eval is pytest integration native or manual stitching like I'm doing. None of this is required for you, just some feedback from an outside brain. import os
import openai
CHAT_MODEL = os.environ.get("CHAT_MODEL", "gpt-4o-mini")
INPUT = "Answer in up to 3 words: Which ocean contains Bouvet Island?"
def main():
client = openai.Client()
messages = [{"role": "user", "content": INPUT}]
chat_completion = client.chat.completions.create(model=CHAT_MODEL, messages=messages)
print(chat_completion.choices[0].message.content)
if __name__ == "__main__":
main() Then eval like this import pytest
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, HallucinationMetric
# Note: We don't opt out of telemetry like below because we override the otel
# config to send to our own collector.
# os.environ["DEEPEVAL_TELEMETRY_OPT_OUT"] = "YES"
def test_evals(capsys):
from chat import main
main()
actual_output = capsys.readouterr().out.strip()
from chat import INPUT
test_case = LLMTestCase(
input=INPUT,
actual_output=actual_output,
context=["Atlantic Ocean"],
)
metrics = [
AnswerRelevancyMetric(threshold=0.7),
HallucinationMetric(threshold=0.8),
]
for metric in metrics:
metric.measure(test_case, False)
if not metric.success:
pytest.fail(f"{type(metric).__name__} scored the following output {metric.score:.1f}: {actual_output}") While I'm not a fan of the sentry callback, the neat thing is that this ends up exported to OTLP via this, if you run pytest with https://github.com/confident-ai/deepeval/blob/main/deepeval/telemetry.py |
f5e0caa
to
9d7d92f
Compare
Fix #915.
There's a lot more to polish/add before merging this but it shows the API I had in mind for benchmark-style / "offline" evals, and an initial stab at an API for (flexibly) producing reports.
The report stuff is probably more configurable than it should/needs to be, but it wasn't too hard to implement so I did. Happy to change how that works.
At least as of now, you can see an example run/output by running
uv run pydantic_ai_slim/pydantic_ai/evals/__init__.py
on this branch (that file has aif __name__ == '__main__'
that produces an example report).As of when I last updated this, in my terminal the report looks like this:

Note that if there are no scores / labels / metrics present in the cases, those columns will be excluded from the report. (So you don't have to pay the visual price unless you make use of those.) You also have the option to include case inputs and/or outputs in the generated reports, and can override most of the value- and diff-rendering logic on a per-score/label/metric basis.