llm-evaluation-framework

Star

Here are 30 public repositories matching this topic...

confident-ai / deepeval

Star

The LLM Evaluation Framework

python evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Mar 13, 2026
Python

msoedov / agentic_security

Star

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

agent-framework ai-red-team prompt-testing llm-security llm-vulnerabilities llm-evaluation llm-fuzzing llm-evaluation-framework llm-guardrails llm-scanner llm-jailbreaks llm-fuzzer llm-fuzzer-aggregator agent-security

Updated Feb 3, 2026
Python

rhesis-ai / rhesis

Star

Open-source platform & SDK for testing LLM and agentic apps. Define expected behavior, generate and run test scenarios, and review failures collaboratively.

open-source test-generation quality-assessment test-management test-execution responsible-ai trustworthy-ai generative-ai llmops llm-evaluation llm-evaluation-framework

Updated Mar 13, 2026
Python

JinjieNi / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Nov 10, 2024
Python

cvs-health / langfair

Star

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

python ai artificial-intelligence bias fairness ai-safety fairness-testing bias-detection fairness-ai fairness-ml responsible-ai ethical-ai large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jan 9, 2026
Python

Addepto / contextcheck

Star

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

open-source ci testing-tools chatbot-framework testing-framework chatbot-testing rag ai-chat large-language-models llm ai-testing llm-evaluation llm-evaluation-framework prompt-test llm-testing ai-testing-tool generative-ai-testing rag-testing summarization-testing

Updated Dec 11, 2024
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Feb 13, 2025
Python

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

qa-automation-test rl-training llm exact-matching llm-evaluation llm-evaluation-toolkit llm-evaluation-framework reward-modeling

Updated Jul 18, 2025
Python

multinear / multinear

Star

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

Updated Sep 2, 2025
Python

zhuohaoyu / KIEval

Star

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

machine-learning explainable-ai llm llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics acl2024

Updated Jul 19, 2024
Python

vero-labs-ai / vero-eval

Star

Open source framework for evaluating AI Agents

python testing evaluation datasets dataset-generation evaluation-metrics evaluation-framework testing-framework testing-library synthetic-dataset-generation user-persona evals llm-evaluation rag-evaluation llm-evaluation-framework langgraph rag-testing

Updated Feb 24, 2026
Python

aws-samples / fm-leaderboarder

Star

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

llm-evaluation llm-evaluation-framework llm-benchmarking

Updated Oct 31, 2024
Python

honeyhiveai / realign

Star

Realign is a testing and simulation framework for AI applications.

ai simulation evaluation alignment red-teaming rag prompt-engineering llms llmops llm-eval llm-evaluation aiengineering llm-evaluation-framework

Updated Dec 4, 2024
Python

ronniross / confidence-scorer

Sponsor

Star

Measure of estimated confidence for non-hallucinative nature of outputs generated by Transformer-based Language Models.

dataset datasets llm llms llm-training llm-evaluation llms-reasoning llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework llm-evaluation-metrics llms-efficency llms-evalution

Updated Feb 26, 2026
Python

stair-lab / melt

Star

Multilingual Evaluation Toolkits

multilingual llms-benchmarking llm-evaluation-framework

Updated Nov 7, 2024
Python

yuzu-ai / ShinRakuda

Star

Shin Rakuda is a comprehensive framework for evaluating and benchmarking Japanese large language models, offering researchers and developers a flexible toolkit for assessing LLM performance across diverse datasets.

japanese llm llm-eval llm-evaluation llm-evaluation-framework