Skip to content

πŸš€ Professional-grade AI Agent Evaluation Platform. Multi-provider LLM-as-a-Judge (OpenAI, Anthropic, Gemini), automated testing, A/B benchmarking, and safety auditing.

License

Notifications You must be signed in to change notification settings

josephsenior/agent-evaluation-platform

Repository files navigation

Agent Evaluation & Testing Platform

Python License Streamlit

A comprehensive testing and evaluation framework for AI agents. This platform provides automated testing, quality evaluation, performance benchmarking, and A/B testing capabilities to ensure your agents are production-ready.

Overview

Building AI agents is one thing, but making sure they actually work well in production is another. This platform gives you the tools you need to:

  • Automated Testing: Run comprehensive test suites against your agents
  • Multi-Provider Support: Use OpenAI, Anthropic, or Google (Gemini) as your evaluation engine
  • Quality Evaluation: Use LLM-as-a-Judge to score responses across multiple criteria
  • Performance Metrics: Track latency, cost, consistency, and more
  • Safety Checks: Detect harmful content, bias, and privacy violations
  • Benchmarking: Establish baselines and track performance over time
  • A/B Testing: Compare different models or configurations statistically
  • Regression Testing: Detect performance regressions automatically

Dashboard Preview

Dashboard Overview The main dashboard provides a high-level overview of all test suites and their recent performance.

Key Features

Multi-Metric Evaluation

Evaluate agent responses across multiple dimensions:

  • Accuracy: Correctness of information
  • Relevance: How well it addresses the query
  • Completeness: Coverage of required information
  • Clarity: Ease of understanding
  • Helpfulness: Practical utility
  • Safety: Content safety checks

Comprehensive Testing

  • Test case management with JSON/YAML support
  • Batch execution of test suites
  • Pass/fail determination based on expected metrics
  • Detailed test result reporting
  • Test result history and storage

Performance Tracking

  • Latency measurement
  • Cost estimation (token usage)
  • Consistency scoring
  • Performance trends over time
  • Degradation detection

Benchmarking

  • Create performance baselines
  • Compare different models
  • Track performance over time
  • Identify best configurations

A/B Testing

  • Compare agent configurations
  • Statistical significance testing
  • Effect size calculation
  • Winner determination

A/B Testing Comparison Detailed statistical comparison between two agent variants.

Regression Testing

  • Compare against baselines
  • Detect performance drops
  • Generate regression reports
  • Flag tests needing attention

Installation

# Clone the repository
git clone <repository-url>
cd agent_evaluation_platform

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GEMINI_API_KEY="your-gemini-key"

Quick Start

1. Create a Test Suite

Create a JSON file with your test cases:

{
  "id": "my_test_suite",
  "name": "My Test Suite",
  "description": "Tests for my agent",
  "test_cases": [
    {
      "id": "test_001",
      "name": "Basic Question",
      "query": "What is AI?",
      "expected_metrics": {
        "accuracy": 7.0,
        "relevance": 7.0
      }
    }
  ]
}

2. Define Your Agent Function

def my_agent(query: str) -> str:
    # Your agent implementation
    return agent_response

3. Run Tests

from backend.test_runner import BatchRunner
from backend.evaluators import LLMJudgeEvaluator, MetricCalculator, SafetyChecker
from backend.models import TestSuite

# Load test suite
test_suite = TestSuite(**test_suite_data)

# Create evaluators with your preferred provider (openai, anthropic, or google)
llm_judge = LLMJudgeEvaluator(
    provider="openai", 
    model_name="gpt-4", 
    api_key=os.getenv("OPENAI_API_KEY")
)
metric_calculator = MetricCalculator()
safety_checker = SafetyChecker(
    provider="openai", 
    model_name="gpt-4", 
    api_key=os.getenv("OPENAI_API_KEY")
)

# Create runner
runner = BatchRunner(
    agent_function=my_agent,
    llm_judge=llm_judge,
    metric_calculator=metric_calculator,
    safety_checker=safety_checker
)

# Run tests
results = runner.run_suite(test_suite)
print(f"Passed: {results.passed}/{results.total_tests}")

4. Use the Dashboard

streamlit run frontend/streamlit_app.py

Open your browser to http://localhost:8501 to access the interactive dashboard.

Docker Support

You can also run the entire platform using Docker:

# Build and start the container
docker-compose up --build

The dashboard will be available at http://localhost:8501. Make sure to populate your .env file with the necessary API keys before running.

Project Structure

agent_evaluation_platform/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ evaluators/          # Evaluation modules
β”‚   β”‚   β”œβ”€β”€ llm_judge.py     # LLM-as-a-Judge evaluator
β”‚   β”‚   β”œβ”€β”€ metric_calculator.py  # Performance metrics
β”‚   β”‚   └── safety_checker.py     # Safety checks
β”‚   β”œβ”€β”€ test_runner/         # Test execution
β”‚   β”‚   β”œβ”€β”€ test_executor.py
β”‚   β”‚   β”œβ”€β”€ batch_runner.py
β”‚   β”‚   └── regression_tester.py
β”‚   β”œβ”€β”€ benchmarking/        # Benchmarking tools
β”‚   β”‚   β”œβ”€β”€ baseline_manager.py
β”‚   β”‚   β”œβ”€β”€ model_comparator.py
β”‚   β”‚   └── performance_tracker.py
β”‚   β”œβ”€β”€ ab_testing/          # A/B testing
β”‚   β”‚   β”œβ”€β”€ experiment_manager.py
β”‚   β”‚   └── statistical_analyzer.py
β”‚   β”œβ”€β”€ storage/             # Data storage
β”‚   β”‚   β”œβ”€β”€ test_results_db.py
β”‚   β”‚   └── metrics_storage.py
β”‚   └── models.py            # Data models
β”œβ”€β”€ frontend/
β”‚   └── streamlit_app.py     # Dashboard UI
β”œβ”€β”€ examples/                # Example scripts
β”œβ”€β”€ tests/
β”‚   └── test_cases/          # Test case files
└── docs/                    # Documentation

Usage Examples

Basic Test Execution

See examples/simple_test.py for a basic example.

Loading Test Suites

See examples/load_test_suite.py for loading test suites from JSON files.

Creating Baselines

from backend.benchmarking import BaselineManager
from backend.storage import TestResultsDB

results_db = TestResultsDB()
baseline_manager = BaselineManager(results_db)

# Create baseline from test results (supports any provider/model)
baseline = baseline_manager.create_baseline(
    batch_result=test_results,
    name="Production Baseline v1.0",
    model_name="gpt-4"
)

A/B Testing

from backend.ab_testing import ExperimentManager
from backend.models import TestSuite, Experiment

experiment_manager = ExperimentManager()

# Create experiment (compare different models or configurations)
experiment = experiment_manager.create_experiment(
    name="Model Comparison",
    test_suite=test_suite,
    variant_a={"agent_function": agent_a, "provider": "openai", "model_name": "gpt-4"},
    variant_b={"agent_function": agent_b, "provider": "anthropic", "model_name": "claude-3-opus"}
)

# Run experiment
experiment = experiment_manager.run_experiment(experiment, test_suite)
print(f"Winner: {experiment.winner}")

Configuration

Environment Variables

The platform requires API keys for the LLM providers you intend to use for evaluation:

  • OPENAI_API_KEY: Required for OpenAI models (GPT-4, etc.)
  • ANTHROPIC_API_KEY: Required for Anthropic models (Claude, etc.)
  • GEMINI_API_KEY: Required for Google models (Gemini, etc.)

Model Selection

The platform supports a wide range of models across providers:

  • OpenAI: gpt-4, gpt-4-turbo, gpt-3.5-turbo
  • Anthropic: claude-3-opus, claude-3-sonnet, claude-3-haiku
  • Google: gemini-1.5-pro, gemini-1.5-flash

Best Practices

Here are some recommendations to help you get the most out of the platform:

  1. Start with Small Test Suites: Don't try to test everything at once. Begin with a few representative test cases that cover your most important scenarios, then expand gradually.

  2. Establish Baselines Early: Before you start making changes to your agents, create a baseline. This gives you something to compare against and helps you catch regressions.

  3. Monitor Trends: Performance can degrade slowly over time. Track metrics over time to catch issues before they become problems.

  4. Use A/B Testing: When you're not sure which configuration is better, use A/B testing. The statistical analysis will tell you which one actually performs better, not just which one seems better.

  5. Set Realistic Expectations: Not every response will be perfect. Adjust your expected metrics based on what's actually achievable for your use case.

  6. Regular Testing: Make testing part of your workflow. Run tests regularly, especially before releases, to catch issues early.

Tech Stack

  • Python 3.8+: Core language
  • LLM-as-a-Judge: Multi-provider support (OpenAI, Anthropic, Google) via LangChain
  • Streamlit: Interactive dashboard
  • Pydantic: Data validation
  • SQLite/PostgreSQL: Test results storage
  • JSON/YAML: Test case formats

Use Cases

  • Agent Quality Assurance: Ensure agents meet quality standards before deployment
  • Performance Monitoring: Track agent performance over time
  • A/B Testing: Compare different agent configurations statistically
  • Regression Detection: Catch performance regressions automatically
  • Benchmarking: Establish baselines and track improvements

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • LLM-as-a-Judge pattern for quality evaluation
  • Streamlit for the dashboard framework
  • The open-source community for inspiration and tools

About

πŸš€ Professional-grade AI Agent Evaluation Platform. Multi-provider LLM-as-a-Judge (OpenAI, Anthropic, Gemini), automated testing, A/B benchmarking, and safety auditing.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published