🐢 Open-Source Evaluation & Testing library for LLM Agents
-
Updated
Feb 3, 2026 - Python
🐢 Open-Source Evaluation & Testing library for LLM Agents
Evaluation and Tracking for LLM Experiments and AI Agents
A single interface to use and evaluate different agent frameworks
Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible methodology. Separates architectural theater from real systems through stress testing, network resilience, and failure analysis.
Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.
Tune your AI Agent to best meet its KPI with a cyclic process of analyze, improve and simulate
Intelligent Context Engineering Assistant for Multi-Agent Systems. Analyze, optimize, and enhance your AI agent configurations with AI-powered insights
A minimal sandbox to run, score, and compare AI agent outputs locally.
A safety-first multi-agent mental health companion with real-time distress tracking, triple-layer guardrails, and evidence-based grounding techniques. Built for Kaggle × Google Agents Intensive 2025 Capstone (Agents for Good Track)
Multi-agent customer support system with Google ADK & Gemini 2.5 Flash Lite. Kaggle capstone demonstrating 11+ concepts. Automates 80%+ queries, <10s response time.
Benchmark framework for evaluating LLM agent continual learning in stateful environments. Features production-realistic CRM workflows with multi-turn conversations, state mutations, and cross-entity relationships. Extensible to additional domains
Neurosim is a Python framework for building, running, and evaluating AI agent systems. It provides core primitives for agent evaluation, cloud storage integration, and an LLM-as-a-judge system for automated scoring.
Qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks
Experiments and analysis on reflection timing in reinforcement learning agents — exploring self-evaluation, meta-learning, and adaptive reflection intervals.
Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.
FluxCodeBench — coding benchmark for evaluating LLM agents on multi-phase programming tasks with hidden requirements.
🤖 Benchmark AI agent capabilities, bridging the gap between hype and reality with clear metrics and insights for informed development decisions.
Agent-agnostic evaluation orchestrator to run scenarios, capture structured behaviour logs, and enable consistent post-hoc comparison across different AI agents.
Visual dashboard to evaluate multi-agent & RAG-based AI apps. Compare models on accuracy, latency, token usage, and trust metrics - powered by NVIDIA AgentIQ
Add a description, image, and links to the agent-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the agent-evaluation topic, visit your repo's landing page and select "manage topics."