ai-evaluation

Here are 40 public repositories matching this topic...

cvs-health / uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated Dec 21, 2025
Python

guestrin-lab / deepscholar-bench

Star

benchmark and evaluate generative research synthesis

dataset-generation benchmark-suite evaluation-framework ai-evaluation deep-research

Updated Dec 1, 2025
Python

kereva-dev / kereva-scanner

Star

Code scanner to check for issues in prompts and LLM calls

cli security ai linter evaluation code-scanning red-teaming ai-security hallucination ai-evaluation llm prompt-injection llm-security ai-code-review llm-evaluation owasp-llm-top-10 ai-performance ai-red-teaming llm-performance

Updated Apr 6, 2025
Python

future-agi / cookbooks

Star

Example Projects integrated with Future AGI Tech Stack for easy AI development

finance marketing development evaluation interview cookbooks healthcare ai-agents mlops ai-evaluation rag-chatbot agentic-ai

Updated Dec 8, 2025
Python

firstlinesoftware / eval-ai-library

Star

Comprehensive AI Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Dec 10, 2025
Python

METR / inspect-action

Star

Running UK AISI's Inspect in the Cloud

ai inspect elicitation ai-evaluation evals

Updated Dec 23, 2025
Python

mhamzaerol / Cost-of-Pass

Star

Cost-of-Pass: An Economic Framework for Evaluating Language Models

benchmark economics language-model evaluation-framework ai-evaluation cost-efficiency cost-performance

Updated Apr 25, 2025
Python

Alab-NII / llm-judge-extract-qa

Star

LLM-as-a-judge for Extractive QA datasets

qa evaluation evaluation-metrics ai-evaluation llm-as-a-judge

Updated Apr 22, 2025
Python

Pro-GenAI / Agent-Action-Guard

Star

🛡️ Safe AI Agents through Action Classifier

Updated Dec 16, 2025
Python

aloth / JudgeGPT

Star

JudgeGPT: An empirical research platform for evaluating the authenticity of AI-generated news.

Updated Oct 6, 2025
Python

ai4society / GenAIResultsComparator

Star

A Python library providing evaluation metrics to compare generated texts from LLMs, often against reference texts. Features streamlined workflows for model comparison and visualization.

python nlp machine-learning natural-language-processing text-analysis ai-evaluation large-language-models llm genai evaluation-metircs text-comparision

Updated Oct 30, 2025
Python

meshkovQA / Eval-ai-library

Star

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework