Agent Evaluations Workshop

A hands-on workshop for learning how to evaluate AI agents using .NET, Microsoft.Extensions.AI, and Azure AI Foundry. This project demonstrates best practices for testing and evaluating AI agent behavior including retrieval accuracy, tool calling, task adherence, intent resolution, and prompt engineering.

🎯 Overview

This workshop teaches you how to build reliable AI agents by implementing structured evaluation patterns. You'll learn to:

Evaluate retrieval accuracy - Ensure your agent retrieves the correct documents from a vector database
Validate tool calling - Verify that agents call the right tools with correct arguments
Measure task adherence - Confirm agents follow instructions and constraints
Assess intent resolution - Test disambiguation of user queries
Iterate on prompts - Use meta-prompt evaluation loops to improve agent behavior

🏗️ Architecture

The solution is built using .NET Aspire for distributed application orchestration:

flowchart TB
    subgraph AppHost["AppHost (Aspire)"]
        direction LR
        Agent["Agent Service<br/>(ASP.NET Core)"]
        Postgres[("Azure Postgres<br/>(pgvector)")]
        Foundry["Azure AI Foundry<br/>(GPT-4o)"]
        
        Agent <--> Postgres
        Agent <--> Foundry
    end

Projects

Project	Description
`AgentEvalsWorkshop`	Main agent service with retrieval, tools, and agent logic
`AgentEvalsWorkshop.AppHost`	.NET Aspire orchestrator for local development
`AgentEvalsWorkshop.ServiceDefaults`	Shared service configuration and extensions
`AgentEvalsWorkshop.Tests`	Evaluation tests using Microsoft.Extensions.AI.Evaluation

📋 Prerequisites

.NET 10 SDK or later
Docker Desktop (for PostgreSQL with pgvector)
Azure CLI (for Azure resources)
An Azure subscription with access to Azure AI Foundry (optional - supports recordings for offline use)

🚀 Getting Started

1. Clone the Repository

git clone https://github.com/seiggy/agent-unit-testing.git
cd agent-unit-testing

2. Login to Azure CLI

az login

3. Start the Aspire Project

# Start the Aspire orchestrator
dotnet run --project src/AgentEvalsWorkshop.AppHost

At this point, you'll be asked to select a subscription and resource group name. Find your associated subscription, and create a resource group name of your choice (or accept the default).

Stop the server. We won't need it for now.

4. Run the Tests

# Run all evaluation tests
dotnet test tests/AgentEvalsWorkshop.Tests

📚 Workshop Exercises

The workshop is structured into five progressive exercises:

US0: Introduction & Environment Setup

Goal: Set up your development environment and configure Azure AI Foundry connectivity

Clone and open the workshop repository
Understand the solution structure
Configure Azure AI Foundry credentials
Verify the Aspire AppHost starts successfully

📄 Full Instructions

US1: TaskAdherenceEvaluator

Goal: Learn to use the TaskAdherenceEvaluator to evaluate agent tool usage

Configure AI evaluation reporting
Use TaskAdherenceEvaluator to measure agent performance
Write integration tests for AI agents
Interpret evaluation metrics and assertions

📄 Full Instructions

US2: Retrieval Evaluation with Built-in Evaluators

Goal: Use multiple built-in evaluators (Relevance, Coherence, Groundedness) together

Use data-driven tests to evaluate multiple scenarios
Work with GroundednessEvaluatorContext for knowledge base validation
Interpret evaluation metrics from multiple evaluators simultaneously

📄 Full Instructions

US3: Creating a Custom Evaluator

Goal: Build a custom AnswerScoringEvaluator using the LLM-as-Judge pattern

Implement the IEvaluator interface
Create custom EvaluationContext classes
Use structured output from LLMs with GetResponseAsync<T>()
Integrate custom evaluators with built-in evaluators

📄 Full Instructions

US4: Meta-Prompt Improvement Loop

Goal: Build a PromptImprovementGenerator for evaluation-driven development

Iterate on prompt structure using AI-generated improvements
Analyze test failures to automatically suggest improved prompts
Track improvement trajectory across iterations
Document prompt engineering decisions

📄 Full Instructions

🧪 Evaluation Framework

This workshop uses Microsoft.Extensions.AI.Evaluation for testing agent behavior:

// Example evaluators
var relevanceEvaluator = new RelevanceEvaluator();
var coherenceEvaluator = new CoherenceEvaluator();
var wordCountEvaluator = new WordCountEvaluator();

Available Evaluators

Evaluator	Purpose
`RelevanceEvaluator`	Measures response relevance to the query
`CoherenceEvaluator`	Assesses logical flow and clarity
`ToolCallAccuracyEvaluator`	Validates correct tool invocations
`TaskAdherenceEvaluator`	Checks compliance with task instructions
`IntentResolutionEvaluator`	Measures disambiguation accuracy

📁 Project Structure

agent-unit-testing/
├── exercises/                    # Workshop exercise instructions
│   ├── US0-intro.md              # Introduction & Environment Setup
│   ├── US1-taskadheranceeval.md  # TaskAdherenceEvaluator
│   ├── US2-retrievalevaluator.md # Retrieval Evaluation with Built-in Evaluators
│   ├── US3-customevaluator.md    # Creating a Custom Evaluator
│   └── US4-meta-prompt.md        # Meta-Prompt Improvement Loop
├── infra/
│   ├── scripts/                  # Infrastructure scripts
│   └── seed/                     # Seed data for PostgreSQL
├── src/
│   ├── AgentEvalsWorkshop/       # Main agent service
│   │   ├── Agents/               # Agent implementations
│   │   ├── Retrieval/            # Vector retrieval logic
│   │   └── Tools/                # Agent tools
│   ├── AgentEvalsWorkshop.AppHost/        # Aspire orchestrator
│   └── AgentEvalsWorkshop.ServiceDefaults/ # Shared configuration
├── tests/
│   └── AgentEvalsWorkshop.Tests/ # Evaluation tests
└── TestResults/                  # Test output and reports

🔧 Configuration

appsettings.json

The application uses standard ASP.NET Core configuration. Key settings:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  }
}

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📖 Resources

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
.specify		.specify
.vscode		.vscode
exercises		exercises
infra		infra
solutions		solutions
src		src
tests/AgentEvalsWorkshop.Tests		tests/AgentEvalsWorkshop.Tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
AgentEvalsWorkshop.slnx		AgentEvalsWorkshop.slnx
Directory.Packages.props		Directory.Packages.props
README.md		README.md
dotnet-tools.json		dotnet-tools.json
global.json		global.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Evaluations Workshop

🎯 Overview

🏗️ Architecture

Projects

📋 Prerequisites

🚀 Getting Started

1. Clone the Repository

2. Login to Azure CLI

3. Start the Aspire Project

4. Run the Tests

📚 Workshop Exercises

US0: Introduction & Environment Setup

US1: TaskAdherenceEvaluator

US2: Retrieval Evaluation with Built-in Evaluators

US3: Creating a Custom Evaluator

US4: Meta-Prompt Improvement Loop

🧪 Evaluation Framework

Available Evaluators

📁 Project Structure

🔧 Configuration

appsettings.json

🤝 Contributing

📖 Resources

📄 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

seiggy/agent-unit-testing

Folders and files

Latest commit

History

Repository files navigation

Agent Evaluations Workshop

🎯 Overview

🏗️ Architecture

Projects

📋 Prerequisites

🚀 Getting Started

1. Clone the Repository

2. Login to Azure CLI

3. Start the Aspire Project

4. Run the Tests

📚 Workshop Exercises

US0: Introduction & Environment Setup

US1: TaskAdherenceEvaluator

US2: Retrieval Evaluation with Built-in Evaluators

US3: Creating a Custom Evaluator

US4: Meta-Prompt Improvement Loop

🧪 Evaluation Framework

Available Evaluators

📁 Project Structure

🔧 Configuration

appsettings.json

🤝 Contributing

📖 Resources

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages