A comprehensive testing and evaluation framework for AI agents. This platform provides automated testing, quality evaluation, performance benchmarking, and A/B testing capabilities to ensure your agents are production-ready.
Building AI agents is one thing, but making sure they actually work well in production is another. This platform gives you the tools you need to:
- Automated Testing: Run comprehensive test suites against your agents
- Multi-Provider Support: Use OpenAI, Anthropic, or Google (Gemini) as your evaluation engine
- Quality Evaluation: Use LLM-as-a-Judge to score responses across multiple criteria
- Performance Metrics: Track latency, cost, consistency, and more
- Safety Checks: Detect harmful content, bias, and privacy violations
- Benchmarking: Establish baselines and track performance over time
- A/B Testing: Compare different models or configurations statistically
- Regression Testing: Detect performance regressions automatically
The main dashboard provides a high-level overview of all test suites and their recent performance.
Evaluate agent responses across multiple dimensions:
- Accuracy: Correctness of information
- Relevance: How well it addresses the query
- Completeness: Coverage of required information
- Clarity: Ease of understanding
- Helpfulness: Practical utility
- Safety: Content safety checks
- Test case management with JSON/YAML support
- Batch execution of test suites
- Pass/fail determination based on expected metrics
- Detailed test result reporting
- Test result history and storage
- Latency measurement
- Cost estimation (token usage)
- Consistency scoring
- Performance trends over time
- Degradation detection
- Create performance baselines
- Compare different models
- Track performance over time
- Identify best configurations
- Compare agent configurations
- Statistical significance testing
- Effect size calculation
- Winner determination
Detailed statistical comparison between two agent variants.
- Compare against baselines
- Detect performance drops
- Generate regression reports
- Flag tests needing attention
# Clone the repository
git clone <repository-url>
cd agent_evaluation_platform
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GEMINI_API_KEY="your-gemini-key"Create a JSON file with your test cases:
{
"id": "my_test_suite",
"name": "My Test Suite",
"description": "Tests for my agent",
"test_cases": [
{
"id": "test_001",
"name": "Basic Question",
"query": "What is AI?",
"expected_metrics": {
"accuracy": 7.0,
"relevance": 7.0
}
}
]
}def my_agent(query: str) -> str:
# Your agent implementation
return agent_responsefrom backend.test_runner import BatchRunner
from backend.evaluators import LLMJudgeEvaluator, MetricCalculator, SafetyChecker
from backend.models import TestSuite
# Load test suite
test_suite = TestSuite(**test_suite_data)
# Create evaluators with your preferred provider (openai, anthropic, or google)
llm_judge = LLMJudgeEvaluator(
provider="openai",
model_name="gpt-4",
api_key=os.getenv("OPENAI_API_KEY")
)
metric_calculator = MetricCalculator()
safety_checker = SafetyChecker(
provider="openai",
model_name="gpt-4",
api_key=os.getenv("OPENAI_API_KEY")
)
# Create runner
runner = BatchRunner(
agent_function=my_agent,
llm_judge=llm_judge,
metric_calculator=metric_calculator,
safety_checker=safety_checker
)
# Run tests
results = runner.run_suite(test_suite)
print(f"Passed: {results.passed}/{results.total_tests}")streamlit run frontend/streamlit_app.pyOpen your browser to http://localhost:8501 to access the interactive dashboard.
You can also run the entire platform using Docker:
# Build and start the container
docker-compose up --buildThe dashboard will be available at http://localhost:8501. Make sure to populate your .env file with the necessary API keys before running.
agent_evaluation_platform/
βββ backend/
β βββ evaluators/ # Evaluation modules
β β βββ llm_judge.py # LLM-as-a-Judge evaluator
β β βββ metric_calculator.py # Performance metrics
β β βββ safety_checker.py # Safety checks
β βββ test_runner/ # Test execution
β β βββ test_executor.py
β β βββ batch_runner.py
β β βββ regression_tester.py
β βββ benchmarking/ # Benchmarking tools
β β βββ baseline_manager.py
β β βββ model_comparator.py
β β βββ performance_tracker.py
β βββ ab_testing/ # A/B testing
β β βββ experiment_manager.py
β β βββ statistical_analyzer.py
β βββ storage/ # Data storage
β β βββ test_results_db.py
β β βββ metrics_storage.py
β βββ models.py # Data models
βββ frontend/
β βββ streamlit_app.py # Dashboard UI
βββ examples/ # Example scripts
βββ tests/
β βββ test_cases/ # Test case files
βββ docs/ # Documentation
See examples/simple_test.py for a basic example.
See examples/load_test_suite.py for loading test suites from JSON files.
from backend.benchmarking import BaselineManager
from backend.storage import TestResultsDB
results_db = TestResultsDB()
baseline_manager = BaselineManager(results_db)
# Create baseline from test results (supports any provider/model)
baseline = baseline_manager.create_baseline(
batch_result=test_results,
name="Production Baseline v1.0",
model_name="gpt-4"
)from backend.ab_testing import ExperimentManager
from backend.models import TestSuite, Experiment
experiment_manager = ExperimentManager()
# Create experiment (compare different models or configurations)
experiment = experiment_manager.create_experiment(
name="Model Comparison",
test_suite=test_suite,
variant_a={"agent_function": agent_a, "provider": "openai", "model_name": "gpt-4"},
variant_b={"agent_function": agent_b, "provider": "anthropic", "model_name": "claude-3-opus"}
)
# Run experiment
experiment = experiment_manager.run_experiment(experiment, test_suite)
print(f"Winner: {experiment.winner}")The platform requires API keys for the LLM providers you intend to use for evaluation:
OPENAI_API_KEY: Required for OpenAI models (GPT-4, etc.)ANTHROPIC_API_KEY: Required for Anthropic models (Claude, etc.)GEMINI_API_KEY: Required for Google models (Gemini, etc.)
The platform supports a wide range of models across providers:
- OpenAI:
gpt-4,gpt-4-turbo,gpt-3.5-turbo - Anthropic:
claude-3-opus,claude-3-sonnet,claude-3-haiku - Google:
gemini-1.5-pro,gemini-1.5-flash
Here are some recommendations to help you get the most out of the platform:
-
Start with Small Test Suites: Don't try to test everything at once. Begin with a few representative test cases that cover your most important scenarios, then expand gradually.
-
Establish Baselines Early: Before you start making changes to your agents, create a baseline. This gives you something to compare against and helps you catch regressions.
-
Monitor Trends: Performance can degrade slowly over time. Track metrics over time to catch issues before they become problems.
-
Use A/B Testing: When you're not sure which configuration is better, use A/B testing. The statistical analysis will tell you which one actually performs better, not just which one seems better.
-
Set Realistic Expectations: Not every response will be perfect. Adjust your expected metrics based on what's actually achievable for your use case.
-
Regular Testing: Make testing part of your workflow. Run tests regularly, especially before releases, to catch issues early.
- Python 3.8+: Core language
- LLM-as-a-Judge: Multi-provider support (OpenAI, Anthropic, Google) via LangChain
- Streamlit: Interactive dashboard
- Pydantic: Data validation
- SQLite/PostgreSQL: Test results storage
- JSON/YAML: Test case formats
- Agent Quality Assurance: Ensure agents meet quality standards before deployment
- Performance Monitoring: Track agent performance over time
- A/B Testing: Compare different agent configurations statistically
- Regression Detection: Catch performance regressions automatically
- Benchmarking: Establish baselines and track improvements
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- LLM-as-a-Judge pattern for quality evaluation
- Streamlit for the dashboard framework
- The open-source community for inspiration and tools