Skip to content

seiggy/agent-unit-testing

Repository files navigation

Agent Evaluations Workshop

A hands-on workshop for learning how to evaluate AI agents using .NET, Microsoft.Extensions.AI, and Azure AI Foundry. This project demonstrates best practices for testing and evaluating AI agent behavior including retrieval accuracy, tool calling, task adherence, intent resolution, and prompt engineering.

🎯 Overview

This workshop teaches you how to build reliable AI agents by implementing structured evaluation patterns. You'll learn to:

  • Evaluate retrieval accuracy - Ensure your agent retrieves the correct documents from a vector database
  • Validate tool calling - Verify that agents call the right tools with correct arguments
  • Measure task adherence - Confirm agents follow instructions and constraints
  • Assess intent resolution - Test disambiguation of user queries
  • Iterate on prompts - Use meta-prompt evaluation loops to improve agent behavior

πŸ—οΈ Architecture

The solution is built using .NET Aspire for distributed application orchestration:

flowchart TB
    subgraph AppHost["AppHost (Aspire)"]
        direction LR
        Agent["Agent Service<br/>(ASP.NET Core)"]
        Postgres[("Azure Postgres<br/>(pgvector)")]
        Foundry["Azure AI Foundry<br/>(GPT-4o)"]
        
        Agent <--> Postgres
        Agent <--> Foundry
    end
Loading

Projects

Project Description
AgentEvalsWorkshop Main agent service with retrieval, tools, and agent logic
AgentEvalsWorkshop.AppHost .NET Aspire orchestrator for local development
AgentEvalsWorkshop.ServiceDefaults Shared service configuration and extensions
AgentEvalsWorkshop.Tests Evaluation tests using Microsoft.Extensions.AI.Evaluation

πŸ“‹ Prerequisites

  • .NET 10 SDK or later
  • Docker Desktop (for PostgreSQL with pgvector)
  • Azure CLI (for Azure resources)
  • An Azure subscription with access to Azure AI Foundry (optional - supports recordings for offline use)

πŸš€ Getting Started

1. Clone the Repository

git clone https://github.com/seiggy/agent-unit-testing.git
cd agent-unit-testing

2. Login to Azure CLI

az login

3. Start the Aspire Project

# Start the Aspire orchestrator
dotnet run --project src/AgentEvalsWorkshop.AppHost

At this point, you'll be asked to select a subscription and resource group name. Find your associated subscription, and create a resource group name of your choice (or accept the default).

Stop the server. We won't need it for now.

4. Run the Tests

# Run all evaluation tests
dotnet test tests/AgentEvalsWorkshop.Tests

πŸ“š Workshop Exercises

The workshop is structured into five progressive exercises:

US0: Introduction & Environment Setup

Goal: Set up your development environment and configure Azure AI Foundry connectivity

  • Clone and open the workshop repository
  • Understand the solution structure
  • Configure Azure AI Foundry credentials
  • Verify the Aspire AppHost starts successfully

πŸ“„ Full Instructions

US1: TaskAdherenceEvaluator

Goal: Learn to use the TaskAdherenceEvaluator to evaluate agent tool usage

  • Configure AI evaluation reporting
  • Use TaskAdherenceEvaluator to measure agent performance
  • Write integration tests for AI agents
  • Interpret evaluation metrics and assertions

πŸ“„ Full Instructions

US2: Retrieval Evaluation with Built-in Evaluators

Goal: Use multiple built-in evaluators (Relevance, Coherence, Groundedness) together

  • Use data-driven tests to evaluate multiple scenarios
  • Work with GroundednessEvaluatorContext for knowledge base validation
  • Interpret evaluation metrics from multiple evaluators simultaneously

πŸ“„ Full Instructions

US3: Creating a Custom Evaluator

Goal: Build a custom AnswerScoringEvaluator using the LLM-as-Judge pattern

  • Implement the IEvaluator interface
  • Create custom EvaluationContext classes
  • Use structured output from LLMs with GetResponseAsync<T>()
  • Integrate custom evaluators with built-in evaluators

πŸ“„ Full Instructions

US4: Meta-Prompt Improvement Loop

Goal: Build a PromptImprovementGenerator for evaluation-driven development

  • Iterate on prompt structure using AI-generated improvements
  • Analyze test failures to automatically suggest improved prompts
  • Track improvement trajectory across iterations
  • Document prompt engineering decisions

πŸ“„ Full Instructions

πŸ§ͺ Evaluation Framework

This workshop uses Microsoft.Extensions.AI.Evaluation for testing agent behavior:

// Example evaluators
var relevanceEvaluator = new RelevanceEvaluator();
var coherenceEvaluator = new CoherenceEvaluator();
var wordCountEvaluator = new WordCountEvaluator();

Available Evaluators

Evaluator Purpose
RelevanceEvaluator Measures response relevance to the query
CoherenceEvaluator Assesses logical flow and clarity
ToolCallAccuracyEvaluator Validates correct tool invocations
TaskAdherenceEvaluator Checks compliance with task instructions
IntentResolutionEvaluator Measures disambiguation accuracy

πŸ“ Project Structure

agent-unit-testing/
β”œβ”€β”€ exercises/                    # Workshop exercise instructions
β”‚   β”œβ”€β”€ US0-intro.md              # Introduction & Environment Setup
β”‚   β”œβ”€β”€ US1-taskadheranceeval.md  # TaskAdherenceEvaluator
β”‚   β”œβ”€β”€ US2-retrievalevaluator.md # Retrieval Evaluation with Built-in Evaluators
β”‚   β”œβ”€β”€ US3-customevaluator.md    # Creating a Custom Evaluator
β”‚   └── US4-meta-prompt.md        # Meta-Prompt Improvement Loop
β”œβ”€β”€ infra/
β”‚   β”œβ”€β”€ scripts/                  # Infrastructure scripts
β”‚   └── seed/                     # Seed data for PostgreSQL
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ AgentEvalsWorkshop/       # Main agent service
β”‚   β”‚   β”œβ”€β”€ Agents/               # Agent implementations
β”‚   β”‚   β”œβ”€β”€ Retrieval/            # Vector retrieval logic
β”‚   β”‚   └── Tools/                # Agent tools
β”‚   β”œβ”€β”€ AgentEvalsWorkshop.AppHost/        # Aspire orchestrator
β”‚   └── AgentEvalsWorkshop.ServiceDefaults/ # Shared configuration
β”œβ”€β”€ tests/
β”‚   └── AgentEvalsWorkshop.Tests/ # Evaluation tests
└── TestResults/                  # Test output and reports

πŸ”§ Configuration

appsettings.json

The application uses standard ASP.NET Core configuration. Key settings:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  }
}

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“– Resources

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •