#

evaluation-framework

Here are 119 public repositories matching this topic...

EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

transformer language-model evaluation-framework

Updated May 19, 2025
Python

confident-ai / deepeval

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated May 20, 2025
Python

huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-metrics evaluation-framework huggingface

Updated May 20, 2025
Python

MaurizioFD / RecSys2019_DeepLearning_Evaluation

This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

Updated May 25, 2023
Python

relari-ai / continuous-eval

Data-Driven Evaluation for LLM-Powered Applications

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated Jan 22, 2025
Python

ServiceNow / AgentLab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

agent benchmark lab agents evaluation-framework web-agents llm prompting llm-agents

Updated May 18, 2025
Python

TonicAI / tonic_validate

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

evaluation-metrics evaluation-framework rag large-language-models llm llms llmops retrieval-augmented-generation

Updated May 19, 2025
Python

athina-ai / athina-evals

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated May 19, 2025
Python

JinjieNi / MixEval

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Nov 10, 2024
Python

aiverify-foundation / moonshot

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

benchmarking evaluation-framework red-teaming trustworthy-ai llm

Updated May 20, 2025
Python

PyDGN

diningphil / PyDGN

A research library for automating experiments on Deep Graph Networks

evaluation-framework deep-graph-networks deep-learning-for-graphs

Updated Sep 9, 2024
Python

lartpang / PySODEvalToolkit

PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection

Updated Sep 27, 2024
Python

microsoft / eureka-ml-insights

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

machine-learning ai artificial-intelligence evaluation-framework llm mllm

Updated May 19, 2025
Python

AI21Labs / lm-evaluation

Evaluation suite for large-scale language models.

language-model evaluation-framework

Updated Aug 15, 2021
Python

nlp-uoregon / mlmm-evaluation

Multilingual Large Language Models Evaluation Benchmark

multilingual nlp natural-language-processing evaluation evaluation-datasets datasets language-model evaluation-framework large-language-models

Updated Aug 21, 2024
Python

HKUSTDial / NL2SQL360

🔥[VLDB'24] Official repository for the paper “The Dawn of Natural Language to SQL: Are We Fully Ready?”

awesome evaluation-metrics evaluation-framework text-to-sql text2sql awesome-text2sql awesome-nl2sql text2sql-evaluation nl2sql-evaluation

Updated Mar 26, 2025
Python

tsenst / CrowdFlow

Optical Flow Dataset and Benchmark for Visual Crowd Analysis

tracking computer-vision dataset video-processing synthetic-images video-surveillance optical-flow benchmark-suite motion-estimation multi-object-tracking evaluation-framework trajectories video-analytics crowd-counting crowd-analysis tracking-by-detection tub-crowdflow-dataset

Updated Aug 11, 2023
Python

kaiko-ai / eva

Evaluation framework for oncology foundation models (FMs)

machine-learning evaluation-framework oncology foundation-models

Updated May 15, 2025
Python

EuroEval / EuroEval

The robust European language model benchmark.

european nlp-machine-learning evaluation-framework german-language english-language spanish-language italian-language french-language finnish-language dutch-language swedish-language danish-language norwegian-language icelandic-language faroese-language llms

Updated May 19, 2025
Python

codefuse-ai / codefuse-evaluation

Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中

code-evaluation lcc evaluation-framework repository-eval codetranseval codecommenteval codefuse

Updated Apr 28, 2025
Python

Improve this page

Add a description, image, and links to the evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-framework topic, visit your repo's landing page and select "manage topics."