llm-eval

Here are 30 public repositories matching this topic...

Giskard-AI / giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Nov 18, 2025
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Jan 7, 2026
Python

datachain-ai / datachain

Star

Analytics, Versioning and ETL for multimodal data: video, audio, PDFs, images

machine-learning ai cv embeddings data-analytics data-wrangling multimodal mlops llm llm-eval

Updated Jan 7, 2026
Python

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

machine-learning monitoring evaluation experimentation jailbreak-detection autoevaluation root-cause-analysis prompt-engineering llmops openai-evals llm-prompting llm-eval llm-test hallucination-detection

Updated Aug 18, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jun 6, 2025
Python

Re-Align / just-eval

Star

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

evaluation gpt4 llm llm-eval llm-evaluation llm-evaluation-toolkit

Updated Jan 29, 2024
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Feb 13, 2025
Python

grigio / llm-eval-simple

Star

llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection

llm llm-eval llm-evaluation-benchmark

Updated Dec 24, 2025
Python

whitecircle-ai / circle-guard-bench

Star

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

benchmarking benchmark ai jailbreak safeguard guardrail guardrails large-language-models llm large-language-model llm-security llm-eval llm-evaluation llm-as-a-judge llm-jailbreaks

Updated Dec 3, 2025
Python

multinear / multinear

Star

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

Updated Sep 2, 2025
Python

izam-mohammed / ragrank

Sponsor

Star

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

machine-learning evaluation language-model rag llm prompt-engineering llmops llm-eval

Updated Jun 11, 2025
Python

alan-turing-institute / prompto

Star

An open source library for asynchronous querying of LLM endpoints

python nlp machine-learning natural-language-processing deep-learning transformers transformer hut23 large-language-models llms llm-eval llm-evaluation

Updated Jul 18, 2025
Python

genia-dev / vibraniumdome

Star

LLM Security Platform.

security openai prompts adversarial-attacks llm prompt-engineering chatgpt llmops large-language-model prompt-injection llm-serving llm-agent llm-security llm-inference llm-eval llm-framework prompt-injection-tool llm-evaluation llm-firewall

Updated Oct 28, 2024
Python

thedataquarry / structured-outputs

Star

Structured output benchmarks comparing DSPy and BAML with different LLMs

information-extraction structured-evaluation structured-output baml dspy llm llm-eval llm-evaluation

Updated Dec 23, 2025
Python

Supahands / llm-comparison-backend

Star

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end

ai llm chatgpt llm-eval llm-api llm-comparison

Updated Jan 5, 2026
Python

honeyhiveai / realign

Star

Realign is a testing and simulation framework for AI applications.

ai simulation evaluation alignment red-teaming rag prompt-engineering llms llmops llm-eval llm-evaluation aiengineering llm-evaluation-framework

Updated Dec 4, 2024
Python

IAAR-Shanghai / GuessArena

Star

[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

benchmark openai evaluation-framework large-language-models chatgpt llm-eval qwen deepseek knowledge-evaluation reliable-evaluation gamearena guessarena domain-specific-eval reasoning-evaluation

Updated Nov 15, 2025
Python

prompt-foundry / python-sdk

Star

The prompt engineering, prompt management, and prompt evaluation tool for Python

python python3 open-ai llm prompt-engineering prompt-management llm-eval llm-evaluation prompt-evaluation

Updated Sep 17, 2024
Python

harshagrawal523 / GenerativeAgents

Star

Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.

docker transformers openai mongodb-atlas pygame-gui llm generative-ai llm-eval

Updated Jul 27, 2023
Python

harlev / eva-l

Star

LLM Evaluation Framework

llm llms llm-eval llm-evaluation

Updated Nov 27, 2024
Python

Improve this page

Add a description, image, and links to the llm-eval topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-eval topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-eval

Here are 30 public repositories matching this topic...

Giskard-AI / giskard-oss

truera / trulens

datachain-ai / datachain

uptrain-ai / uptrain

athina-ai / athina-evals

Re-Align / just-eval

parea-ai / parea-sdk-py

grigio / llm-eval-simple

whitecircle-ai / circle-guard-bench

multinear / multinear

izam-mohammed / ragrank

alan-turing-institute / prompto

genia-dev / vibraniumdome

thedataquarry / structured-outputs

Supahands / llm-comparison-backend

honeyhiveai / realign

IAAR-Shanghai / GuessArena

prompt-foundry / python-sdk

harshagrawal523 / GenerativeAgents

harlev / eva-l

Improve this page

Add this topic to your repo