Skip to content

[Feature] Evaluations #240

@stefanoamorelli

Description

@stefanoamorelli

The REST API already has eval-related endpoints (/apps/{app_name}/eval_sets, /apps/{app_name}/eval_results) but they're all stubbed with Unimplemented handlers. There's currently no way to systematically evaluate agent performance.

Without this, we're stuck manually testing agents every time we make changes. Can't catch regressions, compare different prompts objectively, or know if an agent is actually getting better or worse over time.

I sugget to implement an evaluation framework that lets us define test cases, run them through an agent, and get performance metrics, for example:

Response QA

  • RESPONSE_MATCH_SCORE - ROUGE-1 comparison against expected responses (algorithmic, no LLM needed)
  • SEMANTIC_RESPONSE_MATCH - LLM-as-Judge to check if the response is semantically correct even when phrasing differs.
  • RUBRIC_BASED_RESPONSE_QUALITY - Rubric-based evaluation with quality criteria defined by the user per test case. Things like "is concise", "addresses the question directly", etc..

Tools

  • TOOL_TRAJECTORY_AVG_SCORE - Checks if the agent called the right tools in the right order with correct arguments.
  • RUBRIC_BASED_TOOL_USE_QUALITY - Rubric-based custom criteria for how well tools are used (e.g., "searches before answering", "provides all required parameters").

Possible architecture:

    evaluation/
    ├── types.go              # EvalSet, EvalCase, EvalConfig
    ├── result.go             # Detailed result structures
    ├── evaluator.go          # Evaluator interface
    ├── metrics.go            # Metric type constants
    ├── registry.go           # Evaluator registry
    ├── runner.go             # Evaluation orchestration
    ├── storage/              # Storage implementations
    ├── llmjudge/             # LLM-as-Judge utilities
    └── evaluators/           # Built-in evaluators (8 metrics)
    graph TB
        User[User/API] --> Runner[Evaluation Runner]

        Runner --> Storage[Storage Interface]
        Storage --> MemStore[In-Memory Storage]
        Storage --> FileStore[File-based Storage]

        Runner --> Registry[Evaluator Registry]
        Registry --> Eval1[Response Match]
        Registry --> Eval2[Semantic Match]
        Registry --> Eval3[Tool Trajectory]
        Registry --> EvalN[...]

        Eval2 --> Judge[LLM Judge]
        Eval4 --> Judge
        Eval5 --> Judge

        Judge --> LLM[User's LLM Instance]

        Runner --> Agent[Agent Runner]

        Runner --> Results[Evaluation Results]
        Results --> Storage

Loading

I'm experimenting on a feature branch and would be happy to contribute/discuss the approach and adjust based on feedback!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions