A hands-on workshop for learning how to evaluate AI agents using .NET, Microsoft.Extensions.AI, and Azure AI Foundry. This project demonstrates best practices for testing and evaluating AI agent behavior including retrieval accuracy, tool calling, task adherence, intent resolution, and prompt engineering.
This workshop teaches you how to build reliable AI agents by implementing structured evaluation patterns. You'll learn to:
- Evaluate retrieval accuracy - Ensure your agent retrieves the correct documents from a vector database
- Validate tool calling - Verify that agents call the right tools with correct arguments
- Measure task adherence - Confirm agents follow instructions and constraints
- Assess intent resolution - Test disambiguation of user queries
- Iterate on prompts - Use meta-prompt evaluation loops to improve agent behavior
The solution is built using .NET Aspire for distributed application orchestration:
flowchart TB
subgraph AppHost["AppHost (Aspire)"]
direction LR
Agent["Agent Service<br/>(ASP.NET Core)"]
Postgres[("Azure Postgres<br/>(pgvector)")]
Foundry["Azure AI Foundry<br/>(GPT-4o)"]
Agent <--> Postgres
Agent <--> Foundry
end
| Project | Description |
|---|---|
AgentEvalsWorkshop |
Main agent service with retrieval, tools, and agent logic |
AgentEvalsWorkshop.AppHost |
.NET Aspire orchestrator for local development |
AgentEvalsWorkshop.ServiceDefaults |
Shared service configuration and extensions |
AgentEvalsWorkshop.Tests |
Evaluation tests using Microsoft.Extensions.AI.Evaluation |
- .NET 10 SDK or later
- Docker Desktop (for PostgreSQL with pgvector)
- Azure CLI (for Azure resources)
- An Azure subscription with access to Azure AI Foundry (optional - supports recordings for offline use)
git clone https://github.com/seiggy/agent-unit-testing.git
cd agent-unit-testingaz login# Start the Aspire orchestrator
dotnet run --project src/AgentEvalsWorkshop.AppHostAt this point, you'll be asked to select a subscription and resource group name. Find your associated subscription, and create a resource group name of your choice (or accept the default).
Stop the server. We won't need it for now.
# Run all evaluation tests
dotnet test tests/AgentEvalsWorkshop.TestsThe workshop is structured into five progressive exercises:
Goal: Set up your development environment and configure Azure AI Foundry connectivity
- Clone and open the workshop repository
- Understand the solution structure
- Configure Azure AI Foundry credentials
- Verify the Aspire AppHost starts successfully
π Full Instructions
Goal: Learn to use the TaskAdherenceEvaluator to evaluate agent tool usage
- Configure AI evaluation reporting
- Use TaskAdherenceEvaluator to measure agent performance
- Write integration tests for AI agents
- Interpret evaluation metrics and assertions
π Full Instructions
Goal: Use multiple built-in evaluators (Relevance, Coherence, Groundedness) together
- Use data-driven tests to evaluate multiple scenarios
- Work with GroundednessEvaluatorContext for knowledge base validation
- Interpret evaluation metrics from multiple evaluators simultaneously
π Full Instructions
Goal: Build a custom AnswerScoringEvaluator using the LLM-as-Judge pattern
- Implement the
IEvaluatorinterface - Create custom EvaluationContext classes
- Use structured output from LLMs with
GetResponseAsync<T>() - Integrate custom evaluators with built-in evaluators
π Full Instructions
Goal: Build a PromptImprovementGenerator for evaluation-driven development
- Iterate on prompt structure using AI-generated improvements
- Analyze test failures to automatically suggest improved prompts
- Track improvement trajectory across iterations
- Document prompt engineering decisions
π Full Instructions
This workshop uses Microsoft.Extensions.AI.Evaluation for testing agent behavior:
// Example evaluators
var relevanceEvaluator = new RelevanceEvaluator();
var coherenceEvaluator = new CoherenceEvaluator();
var wordCountEvaluator = new WordCountEvaluator();| Evaluator | Purpose |
|---|---|
RelevanceEvaluator |
Measures response relevance to the query |
CoherenceEvaluator |
Assesses logical flow and clarity |
ToolCallAccuracyEvaluator |
Validates correct tool invocations |
TaskAdherenceEvaluator |
Checks compliance with task instructions |
IntentResolutionEvaluator |
Measures disambiguation accuracy |
agent-unit-testing/
βββ exercises/ # Workshop exercise instructions
β βββ US0-intro.md # Introduction & Environment Setup
β βββ US1-taskadheranceeval.md # TaskAdherenceEvaluator
β βββ US2-retrievalevaluator.md # Retrieval Evaluation with Built-in Evaluators
β βββ US3-customevaluator.md # Creating a Custom Evaluator
β βββ US4-meta-prompt.md # Meta-Prompt Improvement Loop
βββ infra/
β βββ scripts/ # Infrastructure scripts
β βββ seed/ # Seed data for PostgreSQL
βββ src/
β βββ AgentEvalsWorkshop/ # Main agent service
β β βββ Agents/ # Agent implementations
β β βββ Retrieval/ # Vector retrieval logic
β β βββ Tools/ # Agent tools
β βββ AgentEvalsWorkshop.AppHost/ # Aspire orchestrator
β βββ AgentEvalsWorkshop.ServiceDefaults/ # Shared configuration
βββ tests/
β βββ AgentEvalsWorkshop.Tests/ # Evaluation tests
βββ TestResults/ # Test output and reports
The application uses standard ASP.NET Core configuration. Key settings:
{
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft.AspNetCore": "Warning"
}
}
}- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.