Home

Multi-Agent Assessment & Judgement — Wiki

Disclaimer: this wiki pages are still work in progress

An enterprise-grade, multi-agent aware .NET 9 application for automated testing of Microsoft Copilot Studio agents. Test multiple agents simultaneously across environments, compare performance, evaluate responses with AI Foundry models, generate test cases from documents, and get comprehensive metrics.

Quick Links

Topic	Description
Getting Started	Prerequisites, installation, first run
Quick Start	5-minute setup guide
Setup Wizard	Agent-first guided setup
Environment & Agent Discovery	Browse Power Platform environments and import agents
Multi-Agent Testing	Testing multiple agents in parallel
Architecture	System design and project structure
Database Schema	All entities and relationships
Configuration Reference	All configuration options
Agent Configuration	Per-agent settings
Authentication	Entra ID setup and security
RBAC and Roles	Roles and permission matrix
Judge Evaluation	AI scoring dimensions and weights
Judge Prompts	Prompt templates and calibration
Document Processing	Upload and chunk documents
Question Generation	AI-powered test generation
Test Suites and Cases	Creating and managing tests
CLI Reference	Command-line interface
API Reference	REST API endpoints
Deployment	Local, Docker, Azure, Kubernetes
Docker Deployment	Containerization guide
Troubleshooting	Common issues and fixes

Key Capabilities

Multi-Agent Testing — configure agents for dev, staging, and production; run the same suite against all simultaneously
Direct Line Integration — WebSocket or polling transport with full conversation lifecycle management
Model-as-a-Judge — Azure AI Foundry LLM evaluates responses on 5 dimensions (task success, intent match, factuality, helpfulness, safety)
Configurable Judge & Question-Generation Prompts — edit system prompts directly in the UI; per-agent overrides supported; no code changes required
Document-Driven Test Generation — upload PDFs, text files, or paste a public HTTP/HTTPS URL; AI generates test cases automatically from any imported content
Setup Wizard — guided agent-first onboarding flow
CLI for CI/CD — run, list, agents, report, and generate commands; exit codes, JSON/CSV output, dry-run support
Microsoft Entra ID Authentication — optional enterprise SSO with Admin / Tester / Viewer roles
Backup & Restore — download a full database snapshot from Settings; restore with a single upload
OpenAPI & Interactive API Browser — OpenAPI manifest at /openapi/v1.json and Scalar UI at /scalar/v1; importable as a Power Automate custom connector; REST API key auth for CI/CD pipelines
Regression Detection — each run report compares results against the previous run for the same suite; regressed test cases are highlighted with a side-by-side judge rationale comparison
Pass Rate by Category — run reports break down results by TestCase.Category with colour-coded pass-rate bars per topic group
Lightweight Rubric Refinement — when human verdict overrides disagree with the AI judge, a "Refine Rubric" button sends all disagreements to the LLM and returns a proposed rubric update
Home Screen Insights — system status badges (Database / DirectLine / AI Judge), pass rate trend sparkline, agent summary cards, top failing test cases, Quick Run shortcut, and run history feed
Latency & Confidence Trends — sparkline of median latency over last 10 runs on the Dashboard; per-test-case score history dots (green/amber/red) in expanded run report rows
Run History Pruning — configurable retention policy in Settings → Data Management keeps the SQLite database from growing unbounded
Local-First — runs entirely on-premise; only calls Direct Line and an AI Foundry endpoint

Project Structure

CopilotStudioTestRunner.Domain   — Entities, configuration models
CopilotStudioTestRunner.Data     — EF Core DbContext, SQLite
CopilotStudioTestRunner.Core     — Services (Judge, Execution, DirectLine, Documents)
CopilotStudioTestRunner.WebUI    — Blazor Server UI + REST API
CopilotStudioTestRunner.CLI      — Command-line interface
CopilotStudioTestRunner.Tests    — Unit, Integration, End-to-End tests

Home

Multi-Agent Assessment & Judgement — Wiki

Quick Links

Key Capabilities

Project Structure

Contributing

License

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Navigation

Getting Started

Core Features

Configuration

Authentication

Reference

Deployment

Clone this wiki locally