Agent Evaluation & Testing Platform

A comprehensive testing and evaluation framework for AI agents. This platform provides automated testing, quality evaluation, performance benchmarking, and A/B testing capabilities to ensure your agents are production-ready.

Overview

Building AI agents is one thing, but making sure they actually work well in production is another. This platform gives you the tools you need to:

Automated Testing: Run comprehensive test suites against your agents
Multi-Provider Support: Use OpenAI, Anthropic, or Google (Gemini) as your evaluation engine
Quality Evaluation: Use LLM-as-a-Judge to score responses across multiple criteria
Performance Metrics: Track latency, cost, consistency, and more
Safety Checks: Detect harmful content, bias, and privacy violations
Benchmarking: Establish baselines and track performance over time
A/B Testing: Compare different models or configurations statistically
Regression Testing: Detect performance regressions automatically

Dashboard Preview

The main dashboard provides a high-level overview of all test suites and their recent performance.

Key Features

Multi-Metric Evaluation

Evaluate agent responses across multiple dimensions:

Accuracy: Correctness of information
Relevance: How well it addresses the query
Completeness: Coverage of required information
Clarity: Ease of understanding
Helpfulness: Practical utility
Safety: Content safety checks

Comprehensive Testing

Test case management with JSON/YAML support
Batch execution of test suites
Pass/fail determination based on expected metrics
Detailed test result reporting
Test result history and storage

Performance Tracking

Latency measurement
Cost estimation (token usage)
Consistency scoring
Performance trends over time
Degradation detection

Benchmarking

Create performance baselines
Compare different models
Track performance over time
Identify best configurations

A/B Testing

Compare agent configurations
Statistical significance testing
Effect size calculation
Winner determination

Detailed statistical comparison between two agent variants.

Regression Testing

Compare against baselines
Detect performance drops
Generate regression reports
Flag tests needing attention

Installation

# Clone the repository
git clone <repository-url>
cd agent_evaluation_platform

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GEMINI_API_KEY="your-gemini-key"

Quick Start

1. Create a Test Suite

Create a JSON file with your test cases:

{
  "id": "my_test_suite",
  "name": "My Test Suite",
  "description": "Tests for my agent",
  "test_cases": [
    {
      "id": "test_001",
      "name": "Basic Question",
      "query": "What is AI?",
      "expected_metrics": {
        "accuracy": 7.0,
        "relevance": 7.0
      }
    }
  ]
}

2. Define Your Agent Function

def my_agent(query: str) -> str:
    # Your agent implementation
    return agent_response

3. Run Tests

from backend.test_runner import BatchRunner
from backend.evaluators import LLMJudgeEvaluator, MetricCalculator, SafetyChecker
from backend.models import TestSuite

# Load test suite
test_suite = TestSuite(**test_suite_data)

# Create evaluators with your preferred provider (openai, anthropic, or google)
llm_judge = LLMJudgeEvaluator(
    provider="openai", 
    model_name="gpt-4", 
    api_key=os.getenv("OPENAI_API_KEY")
)
metric_calculator = MetricCalculator()
safety_checker = SafetyChecker(
    provider="openai", 
    model_name="gpt-4", 
    api_key=os.getenv("OPENAI_API_KEY")
)

# Create runner
runner = BatchRunner(
    agent_function=my_agent,
    llm_judge=llm_judge,
    metric_calculator=metric_calculator,
    safety_checker=safety_checker
)

# Run tests
results = runner.run_suite(test_suite)
print(f"Passed: {results.passed}/{results.total_tests}")

4. Use the Dashboard

streamlit run frontend/streamlit_app.py

Open your browser to http://localhost:8501 to access the interactive dashboard.

Docker Support

You can also run the entire platform using Docker:

# Build and start the container
docker-compose up --build

The dashboard will be available at http://localhost:8501. Make sure to populate your .env file with the necessary API keys before running.

Project Structure

agent_evaluation_platform/
├── backend/
│   ├── evaluators/          # Evaluation modules
│   │   ├── llm_judge.py     # LLM-as-a-Judge evaluator
│   │   ├── metric_calculator.py  # Performance metrics
│   │   └── safety_checker.py     # Safety checks
│   ├── test_runner/         # Test execution
│   │   ├── test_executor.py
│   │   ├── batch_runner.py
│   │   └── regression_tester.py
│   ├── benchmarking/        # Benchmarking tools
│   │   ├── baseline_manager.py
│   │   ├── model_comparator.py
│   │   └── performance_tracker.py
│   ├── ab_testing/          # A/B testing
│   │   ├── experiment_manager.py
│   │   └── statistical_analyzer.py
│   ├── storage/             # Data storage
│   │   ├── test_results_db.py
│   │   └── metrics_storage.py
│   └── models.py            # Data models
├── frontend/
│   └── streamlit_app.py     # Dashboard UI
├── examples/                # Example scripts
├── tests/
│   └── test_cases/          # Test case files
└── docs/                    # Documentation

Usage Examples

Basic Test Execution

See examples/simple_test.py for a basic example.

Loading Test Suites

See examples/load_test_suite.py for loading test suites from JSON files.

Creating Baselines

from backend.benchmarking import BaselineManager
from backend.storage import TestResultsDB

results_db = TestResultsDB()
baseline_manager = BaselineManager(results_db)

# Create baseline from test results (supports any provider/model)
baseline = baseline_manager.create_baseline(
    batch_result=test_results,
    name="Production Baseline v1.0",
    model_name="gpt-4"
)

A/B Testing

from backend.ab_testing import ExperimentManager
from backend.models import TestSuite, Experiment

experiment_manager = ExperimentManager()

# Create experiment (compare different models or configurations)
experiment = experiment_manager.create_experiment(
    name="Model Comparison",
    test_suite=test_suite,
    variant_a={"agent_function": agent_a, "provider": "openai", "model_name": "gpt-4"},
    variant_b={"agent_function": agent_b, "provider": "anthropic", "model_name": "claude-3-opus"}
)

# Run experiment
experiment = experiment_manager.run_experiment(experiment, test_suite)
print(f"Winner: {experiment.winner}")

Configuration

Environment Variables

The platform requires API keys for the LLM providers you intend to use for evaluation:

OPENAI_API_KEY: Required for OpenAI models (GPT-4, etc.)
ANTHROPIC_API_KEY: Required for Anthropic models (Claude, etc.)
GEMINI_API_KEY: Required for Google models (Gemini, etc.)

Model Selection

The platform supports a wide range of models across providers:

OpenAI: gpt-4, gpt-4-turbo, gpt-3.5-turbo
Anthropic: claude-3-opus, claude-3-sonnet, claude-3-haiku
Google: gemini-1.5-pro, gemini-1.5-flash

Best Practices

Here are some recommendations to help you get the most out of the platform:

Start with Small Test Suites: Don't try to test everything at once. Begin with a few representative test cases that cover your most important scenarios, then expand gradually.
Establish Baselines Early: Before you start making changes to your agents, create a baseline. This gives you something to compare against and helps you catch regressions.
Monitor Trends: Performance can degrade slowly over time. Track metrics over time to catch issues before they become problems.
Use A/B Testing: When you're not sure which configuration is better, use A/B testing. The statistical analysis will tell you which one actually performs better, not just which one seems better.
Set Realistic Expectations: Not every response will be perfect. Adjust your expected metrics based on what's actually achievable for your use case.
Regular Testing: Make testing part of your workflow. Run tests regularly, especially before releases, to catch issues early.

Tech Stack

Python 3.8+: Core language
LLM-as-a-Judge: Multi-provider support (OpenAI, Anthropic, Google) via LangChain
Streamlit: Interactive dashboard
Pydantic: Data validation
SQLite/PostgreSQL: Test results storage
JSON/YAML: Test case formats

Use Cases

Agent Quality Assurance: Ensure agents meet quality standards before deployment
Performance Monitoring: Track agent performance over time
A/B Testing: Compare different agent configurations statistically
Regression Detection: Catch performance regressions automatically
Benchmarking: Establish baselines and track improvements

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

LLM-as-a-Judge pattern for quality evaluation
Streamlit for the dashboard framework
The open-source community for inspiration and tools

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
backend		backend
examples		examples
frontend		frontend
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

License

josephsenior/agent-evaluation-platform

Folders and files

Latest commit

History

Repository files navigation

Agent Evaluation & Testing Platform

Overview

Dashboard Preview

Key Features

Multi-Metric Evaluation

Comprehensive Testing

Performance Tracking

Benchmarking

A/B Testing

Regression Testing

Installation

Quick Start

1. Create a Test Suite

2. Define Your Agent Function

3. Run Tests

4. Use the Dashboard

Docker Support

Project Structure

Usage Examples

Basic Test Execution

Loading Test Suites

Creating Baselines

A/B Testing

Configuration

Environment Variables

Model Selection

Best Practices

Tech Stack

Use Cases

Contributing

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages