The open-source MultiAgentOps evaluation harness for any industry scenario.
-
Updated
Mar 13, 2026 - HTML
The open-source MultiAgentOps evaluation harness for any industry scenario.
A minimal, code-first retrieval observability harness that measures why RAG systems fail to surface relevant evidence, without changing retrieval or generation.
frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.
Controlled experiment isolating reranking as a first-class RAG system boundary, measuring how evidence priority—not recall—changes retrieval outcomes.
Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.
Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.
To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."