[Feature] Evaluations

The REST API already has [eval-related endpoints](https://github.com/google/adk-go/blob/main/server/restapi/routers/eval.go) (`/apps/{app_name}/eval_sets`, `/apps/{app_name}/eval_results`) but they're all stubbed with `Unimplemented` handlers. There's currently no way to systematically evaluate agent performance.

Without this, we're stuck manually testing agents every time we make changes. Can't catch regressions, compare different prompts objectively, or know if an agent is actually getting better or worse over time.

I sugget to implement an evaluation framework that lets us define test cases, run them through an agent, and get performance metrics, for example:

 **Response QA**
- **RESPONSE_MATCH_SCORE** - [ROUGE-1](https://aclanthology.org/W04-1013/) comparison against expected responses (algorithmic, no LLM needed)
- **SEMANTIC_RESPONSE_MATCH** - [LLM-as-Judge](https://arxiv.org/abs/2306.05685) to check if the response is semantically correct even when phrasing differs.
- **RUBRIC_BASED_RESPONSE_QUALITY** - [Rubric-based evaluation](https://arxiv.org/abs/2212.08073) with quality criteria defined by the user per test case. Things like "is concise", "addresses the question directly", etc..

**Tools**
- **TOOL_TRAJECTORY_AVG_SCORE** - Checks if the agent called the right tools in the right order with correct arguments.
- **RUBRIC_BASED_TOOL_USE_QUALITY** - [Rubric-based](https://arxiv.org/abs/2212.08073) custom criteria for how well tools are used (e.g., "searches before answering", "provides all required parameters").

Possible architecture:

```
    evaluation/
    ├── types.go              # EvalSet, EvalCase, EvalConfig
    ├── result.go             # Detailed result structures
    ├── evaluator.go          # Evaluator interface
    ├── metrics.go            # Metric type constants
    ├── registry.go           # Evaluator registry
    ├── runner.go             # Evaluation orchestration
    ├── storage/              # Storage implementations
    ├── llmjudge/             # LLM-as-Judge utilities
    └── evaluators/           # Built-in evaluators (8 metrics)
```

```mermaid
    graph TB
        User[User/API] --> Runner[Evaluation Runner]

        Runner --> Storage[Storage Interface]
        Storage --> MemStore[In-Memory Storage]
        Storage --> FileStore[File-based Storage]

        Runner --> Registry[Evaluator Registry]
        Registry --> Eval1[Response Match]
        Registry --> Eval2[Semantic Match]
        Registry --> Eval3[Tool Trajectory]
        Registry --> EvalN[...]

        Eval2 --> Judge[LLM Judge]
        Eval4 --> Judge
        Eval5 --> Judge

        Judge --> LLM[User's LLM Instance]

        Runner --> Agent[Agent Runner]

        Runner --> Results[Evaluation Results]
        Results --> Storage

```

I'm experimenting on a feature branch and would be happy to contribute/discuss the approach and adjust based on feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Evaluations #240

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Evaluations #240

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions