The LLM Evaluation Framework
-
Updated
Mar 13, 2026 - Python
The LLM Evaluation Framework
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
Open-source platform & SDK for testing LLM and agentic apps. Define expected behavior, generate and run test scenarios, and review failures collaboratively.
The official evaluation suite and dynamic data release for MixEval.
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.
Develop reliable AI apps
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Open source framework for evaluating AI Agents
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
Realign is a testing and simulation framework for AI applications.
Measure of estimated confidence for non-hallucinative nature of outputs generated by Transformer-based Language Models.
Multilingual Evaluation Toolkits
Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.
A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.
Hackable, simple, llm evals on preference datasets
VerifyAI is a simple UI application to test GenAI outputs
LLMs: Cross Lingual Lexical Meanings Benchmark to evaluate Synonym Identification of Large Language Models
Add a description, image, and links to the llm-evaluation-framework topic page so that developers can more easily learn about it.
To associate your repository with the llm-evaluation-framework topic, visit your repo's landing page and select "manage topics."