EvalStack

Production-grade platform for evaluating, observing, and optimizing Generative AI pipelines

🚀 Quick Start • 📖 Documentation • 🏗️ Architecture • 🎯 Demo

Overview

EvalStack is a comprehensive platform designed to evaluate, observe, and optimize Generative AI pipelines including LLMs, RAG systems, and AI agents. Built with production-grade architecture, it provides real-time monitoring, automated optimization recommendations, and scalable evaluation capabilities.

Key Features

🔍 Comprehensive Evaluation: Multiple evaluation adapters (RAGAS, DeepEval, OpenAI Evals, AgentEvals)
📊 Real-time Monitoring: Prometheus metrics, Grafana dashboards, and OpenTelemetry tracing
🤖 Optimization Agent: Rule-based and LLM recommendations with A/B testing capabilities
⚡ Scalable Architecture: Kubernetes-ready with horizontal scaling and auto-scaling
🛡️ Production Ready: CI/CD pipeline, comprehensive testing, and operational procedures
🔧 Developer Friendly: OpenAPI documentation, SDKs, and comprehensive tooling

Quick Start

🐳 Local Development (Docker Compose)

# Clone repository
git clone https://github.com/evalstack/evalstack.git
cd evalstack

# Start all services
docker-compose -f infra/docker-compose/docker-compose.yml up -d

# Wait for services to be ready (2-3 minutes)
docker-compose logs -f

# Access services
# Frontend: http://localhost:3000
# Backend API: http://localhost:8000
# API Docs: http://localhost:8000/docs
# Grafana: http://localhost:3000 (admin/admin)

☸️ Kubernetes Deployment

# Deploy to local Kubernetes (kind/minikube)
./infra/deploy_demo.sh --local

# Deploy to cloud (AWS/GCP/Azure)
./infra/deploy_demo.sh --cloud --image-tag v1.0.0

# Deploy with Helm
helm repo add evalstack https://charts.evalstack.dev
helm install evalstack evalstack/evalstack

🎯 Interactive Demo

# Run comprehensive demo
./demo.sh

# Run specific demo sections
./demo.sh --api-only
./demo.sh --frontend-only
./demo.sh --monitoring

Architecture

EvalStack follows a microservices architecture designed for scalability and reliability:

graph TB
    subgraph "Frontend Layer"
        FE[Next.js Dashboard<br/>Port: 3000]
    end
    
    subgraph "API Layer"
        API[FastAPI Backend<br/>Port: 8000]
    end
    
    subgraph "Processing Layer"
        WORKER[Celery Workers<br/>Async Processing]
    end
    
    subgraph "Data Layer"
        DB[(PostgreSQL<br/>Port: 5432)]
        REDIS[(Redis Cache<br/>Port: 6379)]
        MQ[RabbitMQ<br/>Port: 5672]
    end
    
    subgraph "Monitoring Layer"
        PROM[Prometheus<br/>Metrics Collection]
        GRAF[Grafana<br/>Dashboards]
    end
    
    FE --> API
    API --> DB
    API --> REDIS
    API --> MQ
    MQ --> WORKER
    WORKER --> DB
    WORKER --> REDIS
    API --> PROM
    PROM --> GRAF

Components

Frontend: Next.js dashboard with real-time updates and interactive visualizations
Backend: FastAPI with comprehensive evaluation APIs and OpenAPI documentation
Workers: Celery async processing for evaluation tasks with horizontal scaling
Database: PostgreSQL with optimized schemas and connection pooling
Cache: Redis for session management and result caching
Message Queue: RabbitMQ for reliable task distribution
Monitoring: Prometheus + Grafana observability stack with custom metrics

Demo

🎬 Live Demo

Experience EvalStack with our interactive demo:

# Start the demo
./demo.sh

The demo showcases:

✅ API endpoints and evaluation submission
📊 Real-time monitoring and metrics
🤖 Optimization recommendations
🎯 Frontend dashboard
📈 Performance benchmarks

📊 Sample Data

EvalStack includes comprehensive sample data:

5 Sample Evaluations: Real evaluation results with metrics
6 Optimization Recommendations: AI-generated improvement suggestions
Performance Benchmarks: Load and stress test results
Architecture Diagrams: Visual system overview

Documentation

📚 Core Documentation

API Reference - Complete API documentation with examples
Deployment Guide - Production deployment instructions
Operational Runbook - Operations and troubleshooting
Architecture - System design and components

🎯 Quick Links

Resume Bullets - Technical achievements and impact
Benchmark Results - Performance test results
Demo Data - Sample evaluations and recommendations

Performance

EvalStack delivers production-grade performance:

Metric	Value	Industry Average
Response Time (P95)	< 2s	2.5s
Throughput	100+ req/min	80 req/min
Availability	99.9%	99.0%
Error Rate	< 1%	2%

Benchmark Results

✅ Load Test: 100 concurrent users, 99.2% success rate
✅ Stress Test: 300 concurrent users, 94.1% success rate
✅ Scalability: Horizontal scaling to 500+ users
✅ Reliability: 99.9% uptime with automated recovery

Features

🔍 Evaluation Capabilities

Multi-Adapter Support: RAGAS, DeepEval, OpenAI Evals, AgentEvals
Comprehensive Metrics: Factual accuracy, hallucination detection, context precision
Real-time Processing: Async evaluation with progress tracking
Batch Operations: Support for bulk evaluation processing

📊 Monitoring & Observability

Custom Metrics: Business KPIs and system performance indicators
Real-time Dashboards: Grafana dashboards with interactive visualizations
Alerting: Intelligent thresholds with escalation procedures
Distributed Tracing: OpenTelemetry for request flow analysis

🤖 Optimization Agent

Rule-based Recommendations: Automated performance improvement suggestions
A/B Testing: Built-in experimentation framework
ML-powered Insights: Predictive analytics and trend analysis
Continuous Optimization: Automated pipeline improvement

🛡️ Production Features

Security: Network policies, secrets management, audit logging
Scalability: Auto-scaling, load balancing, resource optimization
Reliability: Circuit breakers, health checks, automated recovery
Compliance: Data protection, audit trails, governance frameworks

Development

🛠️ Development Setup

# Install dependencies
pip install -r backend/requirements.txt
cd frontend && npm install

# Run tests
pytest tests/
npm test

# Run benchmarks
./benchmarks/scripts/run_benchmarks.sh

# Code quality
black backend/ worker/ adapters/
ruff check backend/ worker/ adapters/
mypy backend/ worker/ adapters/

🧪 Testing

Unit Tests: 85%+ coverage with pytest
Integration Tests: API endpoint testing
Performance Tests: k6 load and stress testing
End-to-End Tests: Complete workflow validation

🚀 CI/CD

Automated Testing: Unit, integration, and performance tests
Security Scanning: Vulnerability detection with Trivy
Multi-environment Deployment: Staging and production pipelines
Release Automation: Automated versioning and artifact publishing

Contributing

We welcome contributions! Please see our Contributing Guide for details.

🎯 How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🐛 Bug Reports

Found a bug? Please open an issue with:

Clear description of the problem
Steps to reproduce
Expected vs actual behavior
Environment details

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

📖 Documentation: docs.evalstack.dev
🐛 Issues: GitHub Issues
💬 Community: Discord
📧 Email: support@evalstack.dev

Built with ❤️ by the EvalStack Team

⭐ Star us on GitHub • 🐦 Follow us on Twitter • 📺 Subscribe on YouTube

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
adapters		adapters
backend		backend
benchmarks		benchmarks
demo_data		demo_data
docs		docs
frontend		frontend
infra		infra
tests		tests
worker		worker
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.sh		demo.sh

License

suhasramanand/evalstack

Folders and files

Latest commit

History

Repository files navigation

EvalStack

Overview

Key Features

Quick Start

🐳 Local Development (Docker Compose)

☸️ Kubernetes Deployment

🎯 Interactive Demo

Architecture

Components

Demo

🎬 Live Demo

📊 Sample Data

Documentation

📚 Core Documentation

🎯 Quick Links

Performance

Benchmark Results

Features

🔍 Evaluation Capabilities

📊 Monitoring & Observability

🤖 Optimization Agent

🛡️ Production Features

Development

🛠️ Development Setup

🧪 Testing

🚀 CI/CD

Contributing

🎯 How to Contribute

🐛 Bug Reports

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages