Skip to content

Add pydantic-evals package #935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 124 commits into from
Mar 28, 2025
Merged

Add pydantic-evals package #935

merged 124 commits into from
Mar 28, 2025

Conversation

dmontagu
Copy link
Contributor

@dmontagu dmontagu commented Feb 16, 2025

Fix #915.

There's a lot more to polish/add before merging this but it shows the API I had in mind for benchmark-style / "offline" evals, and an initial stab at an API for (flexibly) producing reports.

The report stuff is probably more configurable than it should/needs to be, but it wasn't too hard to implement so I did. Happy to change how that works.

At least as of now, you can see an example run/output by running uv run pydantic_ai_slim/pydantic_ai/evals/__init__.py on this branch (that file has a if __name__ == '__main__' that produces an example report).

As of when I last updated this, in my terminal the report looks like this:
image

Note that if there are no scores / labels / metrics present in the cases, those columns will be excluded from the report. (So you don't have to pay the visual price unless you make use of those.) You also have the option to include case inputs and/or outputs in the generated reports, and can override most of the value- and diff-rendering logic on a per-score/label/metric basis.

@github-actions github-actions bot temporarily deployed to deploy-preview February 17, 2025 00:02 Inactive
@github-actions github-actions bot temporarily deployed to deploy-preview February 17, 2025 02:36 Inactive
Copy link
Contributor

@hyperlint-ai hyperlint-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style guide flagged several spelling errors that seemed like false positives. We skipped posting inline suggestions for the following words:

  • [Ee]vals
  • Evals
  • Pydantic

@github-actions github-actions bot temporarily deployed to deploy-preview February 17, 2025 22:49 Inactive
Copy link

github-actions bot commented Feb 28, 2025

Docs Preview

commit: e603ee6
Preview URL: https://97272468-pydantic-ai-previews.pydantic.workers.dev

Copy link
Contributor

hyperlint-ai bot commented Mar 10, 2025

PR Change Summary

Introduced initial work on evals API and report generation functionality.

  • Implemented benchmark-style offline evals API
  • Added initial report generation capabilities
  • Configured report options for case inputs and outputs
  • Ensured report columns are dynamically included based on available metrics

Added Files

  • pydantic_evals/README.md

How can I customize these reviews?

Check out the Hyperlint AI Reviewer docs for more information on how to customize the review.

If you just want to ignore it on this PR, you can add the hyperlint-ignore label to the PR. Future changes won't trigger a Hyperlint review.

Note specifically for link checks, we only check the first 30 links in a file and we cache the results for several hours (for instance, if you just added a page, you might experience this). Our recommendation is to add hyperlint-ignore to the PR to ignore the link check for this PR.

@codefromthecrypt
Copy link

I especially like the output you are working on. I would like a clear default OTLP export of the traces made by the evaluator (forgive me if not already done, and I just don't see it)

Here's a toy I am using in deepeval, which isn't as pretty as yours, but exports traces. Ignore that for convenience I'm testing main instead of a function. This is a toy. One thing I would like in any eval is pytest integration native or manual stitching like I'm doing. None of this is required for you, just some feedback from an outside brain.

import os

import openai

CHAT_MODEL = os.environ.get("CHAT_MODEL", "gpt-4o-mini")
INPUT = "Answer in up to 3 words: Which ocean contains Bouvet Island?"


def main():
    client = openai.Client()

    messages = [{"role": "user", "content": INPUT}]
    chat_completion = client.chat.completions.create(model=CHAT_MODEL, messages=messages)
    print(chat_completion.choices[0].message.content)


if __name__ == "__main__":
    main()

Then eval like this

import pytest
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, HallucinationMetric

# Note: We don't opt out of telemetry like below because we override the otel
# config to send to our own collector.
# os.environ["DEEPEVAL_TELEMETRY_OPT_OUT"] = "YES"


def test_evals(capsys):
    from chat import main

    main()
    actual_output = capsys.readouterr().out.strip()

    from chat import INPUT

    test_case = LLMTestCase(
        input=INPUT,
        actual_output=actual_output,
        context=["Atlantic Ocean"],
    )

    metrics = [
        AnswerRelevancyMetric(threshold=0.7),
        HallucinationMetric(threshold=0.8),
    ]
    for metric in metrics:
        metric.measure(test_case, False)

        if not metric.success:
            pytest.fail(f"{type(metric).__name__} scored the following output {metric.score:.1f}: {actual_output}")

While I'm not a fan of the sentry callback, the neat thing is that this ends up exported to OTLP via this, if you run pytest with opentelemetry-instrument

https://github.com/confident-ai/deepeval/blob/main/deepeval/telemetry.py

@samuelcolvin samuelcolvin enabled auto-merge (squash) March 28, 2025 14:54
@samuelcolvin samuelcolvin merged commit 51d642b into main Mar 28, 2025
14 checks passed
@samuelcolvin samuelcolvin deleted the dmontagu/evals branch March 28, 2025 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evals
5 participants