Build software better, together

Giskard-AI / giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Feb 3, 2026
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Feb 2, 2026
Python

mozilla-ai / any-agent

Star

A single interface to use and evaluate different agent frameworks

ai mcp agents a2a agent-evaluation

Updated Feb 1, 2026
Python

Cre4T3Tiv3 / ai-agents-reality-check

Sponsor

Star

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible methodology. Separates architectural theater from real systems through stress testing, network resilience, and failure analysis.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Aug 8, 2025
Python

hidai25 / eval-view

Star

Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.

testing agent tools evaluation pytest ai-agents mlops llm langchain llmops anthropic openai-assistants crewai langgraph agentic-ai langgraph-python crewai-tools agent-evaluation agent-benchmark

Updated Feb 2, 2026
Python

SparkBeyond / agentune

Star

Tune your AI Agent to best meet its KPI with a cyclic process of analyze, improve and simulate

customer-support customer-service conversational-agents ai-agents chatbot-evaluation agent-simulator kpi-analysis agent-evaluation agent-optimization sales-agents customer-facing-agents kpi-optimization

Updated Jan 14, 2026
Python

shiragannavar / Testing-RAG

Star

evaluation ground-truth llm generative-ai agent-evaluation

Updated May 12, 2025
Python

lml2468 / ContextOptimizer

Star

Intelligent Context Engineering Assistant for Multi-Agent Systems. Analyze, optimize, and enhance your AI agent configurations with AI-powered insights

multi-agent-systems prompt-engineering agent-evaluation context-engineering agent-optimizer

Updated Jul 5, 2025
Python

diorwave / agent-playground

Star

A minimal sandbox to run, score, and compare AI agent outputs locally.

python experimental minimal deterministic ai-agents local-tools research-tools agent-evaluation agent-comparsion agent-playground

Updated Dec 19, 2025
Python

ajmal-uk / kaggle-capstone-ai-agent

Sponsor

Star

A safety-first multi-agent mental health companion with real-time distress tracking, triple-layer guardrails, and evidence-based grounding techniques. Built for Kaggle × Google Agents Intensive 2025 Capstone (Agents for Good Track)

gemini mental-health gradio observability crisis-support ai-agents multi-agent-system responsible-ai huggingface-spaces llm-safety agent-evaluation a2a-protocol grounding-techniques agents-for-good

Updated Nov 26, 2025
Python

Rayyan-Oumlil / CustoFlow

Star

Multi-agent customer support system with Google ADK & Gemini 2.5 Flash Lite. Kaggle capstone demonstrating 11+ concepts. Automates 80%+ queries, <10s response time.

production-ready observability multi-agent-system openapi-tools supabase agent-evaluation google-adk a2a-protocol

Updated Jan 5, 2026
Python

Arc-Computer / CL-Bench

Star

Benchmark framework for evaluating LLM agent continual learning in stateful environments. Features production-realistic CRM workflows with multi-turn conversations, state mutations, and cross-entity relationships. Extensible to additional domains

benchmark continual-learning agent-evaluation

Updated Nov 14, 2025
Python

anaishowland / neurosim

Star

Neurosim is a Python framework for building, running, and evaluating AI agent systems. It provides core primitives for agent evaluation, cloud storage integration, and an LLM-as-a-judge system for automated scoring.

computer-vision evaluation-metrics evaluation-framework web-agent evals computer-use agent-evaluation

Updated Oct 29, 2025
Python

pyros-projects / agent-comparison

Star

Qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks

orchestration ai-agents ai-benchmarks qualitative-evaluation llm-agents coding-agents agentic-workflows agent-evaluation agent-testing ai-coding-assistants agent-comparison development-tasks

Updated Nov 25, 2025
Python

srikanthbaride / reflection-timing

Star

Experiments and analysis on reflection timing in reinforcement learning agents — exploring self-evaluation, meta-learning, and adaptive reflection intervals.

python machine-learning reflection latex reinforcement-learning research-paper meta-learning self-play agent-evaluation

Updated Oct 8, 2025
Python

PabloCabaleiro / pondera

Star

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Oct 23, 2025
Python

redush-com / FluxCodeBench

Star

FluxCodeBench — coding benchmark for evaluating LLM agents on multi-phase programming tasks with hidden requirements.

python benchmark machine-learning code-generation evaluation-framework ai-agents llm llm-evaluation agent-evaluation coding-benchmark

Updated Jan 18, 2026
Python

MohamedEmad219 / ai-agents-reality-check

Star

🤖 Benchmark AI agent capabilities, bridging the gap between hype and reality with clear metrics and insights for informed development decisions.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Feb 3, 2026
Python

david-revell / agent-evaluation-orchestrator

Star

Agent-agnostic evaluation orchestrator to run scenarios, capture structured behaviour logs, and enable consistent post-hoc comparison across different AI agents.

python orchestration ai-agents rag experiment-tracking ai-engineering llm-evaluation agent-evaluation

Updated Jan 3, 2026
Python

ahsanblock / NVIDIA-AgentIQ-Agents-Evaluator

Star

Visual dashboard to evaluate multi-agent & RAG-based AI apps. Compare models on accuracy, latency, token usage, and trust metrics - powered by NVIDIA AgentIQ

nvidia multi-agent-systems model-comparison production-ai rag streamlit trustworthy-ai llmops genai enterprise-ai llm-evaluation open-source-ai agent-evaluation agentiq pipeline-evaluation

Updated Apr 10, 2025
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-evaluation

Here are 24 public repositories matching this topic...

Giskard-AI / giskard-oss

truera / trulens

mozilla-ai / any-agent

Cre4T3Tiv3 / ai-agents-reality-check

hidai25 / eval-view

SparkBeyond / agentune

shiragannavar / Testing-RAG

lml2468 / ContextOptimizer

diorwave / agent-playground

ajmal-uk / kaggle-capstone-ai-agent

Rayyan-Oumlil / CustoFlow

Arc-Computer / CL-Bench

anaishowland / neurosim

pyros-projects / agent-comparison

srikanthbaride / reflection-timing

PabloCabaleiro / pondera

redush-com / FluxCodeBench

MohamedEmad219 / ai-agents-reality-check

david-revell / agent-evaluation-orchestrator

ahsanblock / NVIDIA-AgentIQ-Agents-Evaluator

Improve this page

Add this topic to your repo