-
Notifications
You must be signed in to change notification settings - Fork 285
Open
Description
The REST API already has eval-related endpoints (/apps/{app_name}/eval_sets, /apps/{app_name}/eval_results) but they're all stubbed with Unimplemented handlers. There's currently no way to systematically evaluate agent performance.
Without this, we're stuck manually testing agents every time we make changes. Can't catch regressions, compare different prompts objectively, or know if an agent is actually getting better or worse over time.
I sugget to implement an evaluation framework that lets us define test cases, run them through an agent, and get performance metrics, for example:
Response QA
- RESPONSE_MATCH_SCORE - ROUGE-1 comparison against expected responses (algorithmic, no LLM needed)
- SEMANTIC_RESPONSE_MATCH - LLM-as-Judge to check if the response is semantically correct even when phrasing differs.
- RUBRIC_BASED_RESPONSE_QUALITY - Rubric-based evaluation with quality criteria defined by the user per test case. Things like "is concise", "addresses the question directly", etc..
Tools
- TOOL_TRAJECTORY_AVG_SCORE - Checks if the agent called the right tools in the right order with correct arguments.
- RUBRIC_BASED_TOOL_USE_QUALITY - Rubric-based custom criteria for how well tools are used (e.g., "searches before answering", "provides all required parameters").
Possible architecture:
evaluation/
├── types.go # EvalSet, EvalCase, EvalConfig
├── result.go # Detailed result structures
├── evaluator.go # Evaluator interface
├── metrics.go # Metric type constants
├── registry.go # Evaluator registry
├── runner.go # Evaluation orchestration
├── storage/ # Storage implementations
├── llmjudge/ # LLM-as-Judge utilities
└── evaluators/ # Built-in evaluators (8 metrics)
graph TB
User[User/API] --> Runner[Evaluation Runner]
Runner --> Storage[Storage Interface]
Storage --> MemStore[In-Memory Storage]
Storage --> FileStore[File-based Storage]
Runner --> Registry[Evaluator Registry]
Registry --> Eval1[Response Match]
Registry --> Eval2[Semantic Match]
Registry --> Eval3[Tool Trajectory]
Registry --> EvalN[...]
Eval2 --> Judge[LLM Judge]
Eval4 --> Judge
Eval5 --> Judge
Judge --> LLM[User's LLM Instance]
Runner --> Agent[Agent Runner]
Runner --> Results[Evaluation Results]
Results --> Storage
I'm experimenting on a feature branch and would be happy to contribute/discuss the approach and adjust based on feedback!
sks, ipfans, kostyay, D-sense, MikeSchlosser16 and 2 more
Metadata
Metadata
Assignees
Labels
No labels