Closed
Description
(Or Evils as I'm coming to think of them)
We want to build an open-source, sane way to score the performance of LLM calls that is:
- local first - so you don't need to use a service
- flexible enough to work with whatever best practice emerges — ideally usable for any code that is stochastic enough to require scoring beyond passed/failed (that means LLM SDKs directly or even other agent frameworks)
- usable both for "offline evals" (unit-test style checks on performance) and "online evals" measuring performance in production or equivalent (presumably using an observability platform like Pydantic Logfire)
- usable with Pydantic Logfire when and where that actually helps
I believe @dmontagu has a plan.