Skip to content
View wandgibaut's full-sized avatar

Highlights

  • Pro

Block or report wandgibaut

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

LLM Evaluation

5 repositories

A unified evaluation framework for large language models

Python 2,577 190 Updated Feb 11, 2025

ScienceWorld is a text-based virtual environment centered around accomplishing tasks from the standardized elementary science curriculum.

Scala 247 26 Updated Oct 16, 2024

An extensible benchmark for evaluating large language models on planning

PDDL 334 36 Updated Mar 27, 2025

An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]

SAS 297 31 Updated May 20, 2024

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM

Jupyter Notebook 1,415 169 Updated Mar 21, 2025