Skip to content

Conversation

@stefanoamorelli
Copy link

@stefanoamorelli stefanoamorelli commented Nov 10, 2025

This PR introduces an evaluation framework for testing and measuring AI agent performance, it supports both algorithmic and LLM-as-Judge evaluation methods, with built-in support for response quality, tool usage, safety, and hallucination detection.

Tip

This PR uses atomic commits organized by feature. For the best review experience, I suggest to review commit-by-commit to see the logical progression of the implementation.

Note

I follow conventional commits specification for a structured commit history.


Features:

  • Evaluation Methods
    • Algorithmic evaluators (ROUGE-1 scoring, exact matching);
    • LLM-as-Judge with customizable rubrics;
    • Multi-sample evaluation.
  • 8 Metrics
    • Response quality: match score, semantic matching, coherence, rubric-based;
    • Tool usage: trajectory scoring, rubric-based quality;
    • Safety & quality: harmlessness, hallucination detection.
  • Flexible Storage
    • In-memory storage for development/testing;
    • File-based storage with JSON persistence for CI/CD;

Usage

// Create evaluation runner
  evalRunner := evaluation.NewRunner(evaluation.RunnerConfig{
      AgentRunner:        agentRunner,
      Storage:            evalStorage,
      SessionService:     sessionService,
      AppName:            "my-app",
      RateLimitDelay:     6 * time.Second,
      MaxConcurrentEvals: 10,
  })

  // Define evaluation criteria
  config := &evaluation.EvalConfig{
      JudgeLLM:   judgeLLM,
      JudgeModel: "gemini-2.5-flash",
      Criteria: []evaluation.Criterion{
          &evaluation.Threshold{
              MinScore:   0.8,
              MetricType: evaluation.MetricResponseMatch,
          },
          &evaluation.LLMAsJudgeCriterion{
              Threshold: &evaluation.Threshold{
                  MinScore:   0.9,
                  MetricType: evaluation.MetricSafety,
              },
              MetricType: evaluation.MetricSafety,
              JudgeModel: "gemini-2.5-flash",
          },
      },
  }

  // Run evaluation
  result, err := evalRunner.RunEvalSet(ctx, evalSet, config)

Testing

2 examples are provided to demonstrate the features:

  • examples/evaluation/basic/ - Simple introduction with 2 evaluators;
  • examples/evaluation/comprehensive/ - Full feature example with all the 8 metrics.

Run examples:

export GOOGLE_API_KEY=your_key
cd examples/evaluation/basic
go run main.go

cd examples/evaluation/comprehensive
go run main.go

@stefanoamorelli stefanoamorelli force-pushed the feature/evaluation-framework branch 6 times, most recently from 2c120c1 to 9aba0da Compare November 12, 2025 21:45
@stefanoamorelli stefanoamorelli marked this pull request as ready for review November 12, 2025 21:45
@stefanoamorelli stefanoamorelli force-pushed the feature/evaluation-framework branch from 9aba0da to b454fdd Compare November 16, 2025 18:14
@stefanoamorelli stefanoamorelli force-pushed the feature/evaluation-framework branch from b454fdd to f99730b Compare November 16, 2025 18:18
@ivanmkc
Copy link
Collaborator

ivanmkc commented Nov 19, 2025

I saw your question in the community call. I'll have to defer to @mazas-google on roadmap questions regarding Eval.

I imagine the API will closely follow adk-python's implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants