evaluation-framework

Here are 14 public repositories matching this topic...

promptfoo / promptfoo

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Feb 20, 2026
TypeScript

empirical-run / empirical

Star

Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application

testing test-automation evaluation-framework testing-framework llm llmops llm-inference

Updated Aug 20, 2024
TypeScript

tensorstax / agenttrace

Star

AgentTrace is a lightweight observability library to trace and evaluate agentic systems.

tracing eval agents observability evaluation-framework agentic-ai

Updated Apr 1, 2025
TypeScript

loveholidays / eval-kit

Star

TypeScript SDK for evaluating content quality using both traditional metrics and LLM-powered evaluation

evaluation eval evaluation-metrics evaluation-framework llm-evaluation

Updated Feb 3, 2026
TypeScript

sjnims / cc-plugin-eval

Star

4-stage evaluation framework for testing Claude Code plugin component triggering. Validates skills, agents, and commands activate correctly via programmatic detection and LLM judgment.

cli typescript test-automation developer-tools evaluation-framework testing-framework claude llm ai-testing anthropic claude-code claude-agent-sdk plugin-testing

Updated Feb 18, 2026
TypeScript

EntityProcess / agentv

Star

Light-weight AI agent evaluation and optimization framework

vscode evaluation-framework llms agentic-ai

Updated Feb 20, 2026
TypeScript

yukinagae / genkitx-promptfoo

Star

Community Plugin for Genkit to use Promptfoo

plugin testing firebase ai evaluation prompt prompts evaluation-framework llm llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework promptfoo genkit genkitx genkit-plugin

Updated Jan 3, 2025
TypeScript

Mercity-AI / LLM-Feedback-Collector

Star

Quick web app to collect feedback from LLM chat interactions

data-collection evaluation-framework llm-benchmarking

Updated Sep 14, 2025
TypeScript

Tuesdaythe13th / H4RB1NG3R

Star

This repository represents the transition from behavioral safety to Neural Forensics. It provides the infrastructure to detect, audit, and mitigate high-order AI risks—such as Latent Deception, Sycophancy-Masking, and Synthetic Intimacy—directly at the mechanistic activation layer.