Evals

(Or Evils as I'm coming to think of them)

We want to build an open-source, sane way to score the performance of LLM calls that is:
* local first - so you don't need to use a service
* flexible enough to work with whatever best practice emerges — ideally usable for any code that is stochastic enough to require scoring beyond passed/failed (that means LLM SDKs directly or even other agent frameworks)
* usable both for "offline evals" (unit-test style checks on performance) and "online evals" measuring performance in production or equivalent (presumably using an observability platform like Pydantic Logfire)
* usable with Pydantic Logfire when and where that actually helps

I believe @dmontagu has a plan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evals #915

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evals #915

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions