A test runner for agentskills.io-style AI agent skills
-
Updated
May 7, 2026 - TypeScript
A test runner for agentskills.io-style AI agent skills
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
Squeeze your model with pressure prompts to see if its behavior leaks.
Codex-native autoresearch harness with structured worker/judge turns for optimizing anything you can measure.
A framework for evaluating large language models (LLMs) across a variety of tasks.
Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.
Disposable Daytona sandboxes for LLM evals and isolated command execution
Evaluation patterns, release gates, and anti-hallucination techniques for developer-focused AI workflows.
7 Claude Code skills for software architecture review (Python, web, cloud, microservices). Includes A/B benchmarks against unskilled baseline, assertion-graded eval suite, and interactive dashboards.
Local-first LLM evaluation runner with baselines, caching, markdown reports, and CI-friendly quality, latency, and cost gates.
In this we evaluate the LLM responses and find accuracy
This project demonstrates a production-grade Evaluation (Evals) Framework used to benchmark multiple Large Language Models (LLMs) against a "Source of Truth" NBA dataset.
Evaluation and reliability harness for agentic LLM systems, with task success, latency, cost, retries, fallback routing, and failure taxonomy.
Offline-first eval harness for comparing OpenAI-compatible gateway route fixtures
Sovereign Adversarial Simulation & Interdiction Engine for the 0.05V Standard.
Synthetic marketplace benchmark harness with deterministic demo and Codex subagent pilot
Add a description, image, and links to the llm-evals topic page so that developers can more easily learn about it.
To associate your repository with the llm-evals topic, visit your repo's landing page and select "manage topics."