Skip to content

Evals #915

Closed
Closed
@samuelcolvin

Description

@samuelcolvin

(Or Evils as I'm coming to think of them)

We want to build an open-source, sane way to score the performance of LLM calls that is:

  • local first - so you don't need to use a service
  • flexible enough to work with whatever best practice emerges — ideally usable for any code that is stochastic enough to require scoring beyond passed/failed (that means LLM SDKs directly or even other agent frameworks)
  • usable both for "offline evals" (unit-test style checks on performance) and "online evals" measuring performance in production or equivalent (presumably using an observability platform like Pydantic Logfire)
  • usable with Pydantic Logfire when and where that actually helps

I believe @dmontagu has a plan.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions