Production-grade platform for evaluating, observing, and optimizing Generative AI pipelines
π Quick Start β’ π Documentation β’ ποΈ Architecture β’ π― Demo
EvalStack is a comprehensive platform designed to evaluate, observe, and optimize Generative AI pipelines including LLMs, RAG systems, and AI agents. Built with production-grade architecture, it provides real-time monitoring, automated optimization recommendations, and scalable evaluation capabilities.
- π Comprehensive Evaluation: Multiple evaluation adapters (RAGAS, DeepEval, OpenAI Evals, AgentEvals)
- π Real-time Monitoring: Prometheus metrics, Grafana dashboards, and OpenTelemetry tracing
- π€ Optimization Agent: Rule-based and LLM recommendations with A/B testing capabilities
- β‘ Scalable Architecture: Kubernetes-ready with horizontal scaling and auto-scaling
- π‘οΈ Production Ready: CI/CD pipeline, comprehensive testing, and operational procedures
- π§ Developer Friendly: OpenAPI documentation, SDKs, and comprehensive tooling
# Clone repository
git clone https://github.com/evalstack/evalstack.git
cd evalstack
# Start all services
docker-compose -f infra/docker-compose/docker-compose.yml up -d
# Wait for services to be ready (2-3 minutes)
docker-compose logs -f
# Access services
# Frontend: http://localhost:3000
# Backend API: http://localhost:8000
# API Docs: http://localhost:8000/docs
# Grafana: http://localhost:3000 (admin/admin)# Deploy to local Kubernetes (kind/minikube)
./infra/deploy_demo.sh --local
# Deploy to cloud (AWS/GCP/Azure)
./infra/deploy_demo.sh --cloud --image-tag v1.0.0
# Deploy with Helm
helm repo add evalstack https://charts.evalstack.dev
helm install evalstack evalstack/evalstack# Run comprehensive demo
./demo.sh
# Run specific demo sections
./demo.sh --api-only
./demo.sh --frontend-only
./demo.sh --monitoringEvalStack follows a microservices architecture designed for scalability and reliability:
graph TB
subgraph "Frontend Layer"
FE[Next.js Dashboard<br/>Port: 3000]
end
subgraph "API Layer"
API[FastAPI Backend<br/>Port: 8000]
end
subgraph "Processing Layer"
WORKER[Celery Workers<br/>Async Processing]
end
subgraph "Data Layer"
DB[(PostgreSQL<br/>Port: 5432)]
REDIS[(Redis Cache<br/>Port: 6379)]
MQ[RabbitMQ<br/>Port: 5672]
end
subgraph "Monitoring Layer"
PROM[Prometheus<br/>Metrics Collection]
GRAF[Grafana<br/>Dashboards]
end
FE --> API
API --> DB
API --> REDIS
API --> MQ
MQ --> WORKER
WORKER --> DB
WORKER --> REDIS
API --> PROM
PROM --> GRAF
- Frontend: Next.js dashboard with real-time updates and interactive visualizations
- Backend: FastAPI with comprehensive evaluation APIs and OpenAPI documentation
- Workers: Celery async processing for evaluation tasks with horizontal scaling
- Database: PostgreSQL with optimized schemas and connection pooling
- Cache: Redis for session management and result caching
- Message Queue: RabbitMQ for reliable task distribution
- Monitoring: Prometheus + Grafana observability stack with custom metrics
Experience EvalStack with our interactive demo:
# Start the demo
./demo.shThe demo showcases:
- β API endpoints and evaluation submission
- π Real-time monitoring and metrics
- π€ Optimization recommendations
- π― Frontend dashboard
- π Performance benchmarks
EvalStack includes comprehensive sample data:
- 5 Sample Evaluations: Real evaluation results with metrics
- 6 Optimization Recommendations: AI-generated improvement suggestions
- Performance Benchmarks: Load and stress test results
- Architecture Diagrams: Visual system overview
- API Reference - Complete API documentation with examples
- Deployment Guide - Production deployment instructions
- Operational Runbook - Operations and troubleshooting
- Architecture - System design and components
- Resume Bullets - Technical achievements and impact
- Benchmark Results - Performance test results
- Demo Data - Sample evaluations and recommendations
EvalStack delivers production-grade performance:
| Metric | Value | Industry Average |
|---|---|---|
| Response Time (P95) | < 2s | 2.5s |
| Throughput | 100+ req/min | 80 req/min |
| Availability | 99.9% | 99.0% |
| Error Rate | < 1% | 2% |
- β Load Test: 100 concurrent users, 99.2% success rate
- β Stress Test: 300 concurrent users, 94.1% success rate
- β Scalability: Horizontal scaling to 500+ users
- β Reliability: 99.9% uptime with automated recovery
- Multi-Adapter Support: RAGAS, DeepEval, OpenAI Evals, AgentEvals
- Comprehensive Metrics: Factual accuracy, hallucination detection, context precision
- Real-time Processing: Async evaluation with progress tracking
- Batch Operations: Support for bulk evaluation processing
- Custom Metrics: Business KPIs and system performance indicators
- Real-time Dashboards: Grafana dashboards with interactive visualizations
- Alerting: Intelligent thresholds with escalation procedures
- Distributed Tracing: OpenTelemetry for request flow analysis
- Rule-based Recommendations: Automated performance improvement suggestions
- A/B Testing: Built-in experimentation framework
- ML-powered Insights: Predictive analytics and trend analysis
- Continuous Optimization: Automated pipeline improvement
- Security: Network policies, secrets management, audit logging
- Scalability: Auto-scaling, load balancing, resource optimization
- Reliability: Circuit breakers, health checks, automated recovery
- Compliance: Data protection, audit trails, governance frameworks
# Install dependencies
pip install -r backend/requirements.txt
cd frontend && npm install
# Run tests
pytest tests/
npm test
# Run benchmarks
./benchmarks/scripts/run_benchmarks.sh
# Code quality
black backend/ worker/ adapters/
ruff check backend/ worker/ adapters/
mypy backend/ worker/ adapters/- Unit Tests: 85%+ coverage with pytest
- Integration Tests: API endpoint testing
- Performance Tests: k6 load and stress testing
- End-to-End Tests: Complete workflow validation
- Automated Testing: Unit, integration, and performance tests
- Security Scanning: Vulnerability detection with Trivy
- Multi-environment Deployment: Staging and production pipelines
- Release Automation: Automated versioning and artifact publishing
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Found a bug? Please open an issue with:
- Clear description of the problem
- Steps to reproduce
- Expected vs actual behavior
- Environment details
This project is licensed under the MIT License - see the LICENSE file for details.
- π Documentation: docs.evalstack.dev
- π Issues: GitHub Issues
- π¬ Community: Discord
- π§ Email: support@evalstack.dev
Built with β€οΈ by the EvalStack Team
β Star us on GitHub β’ π¦ Follow us on Twitter β’ πΊ Subscribe on YouTube