clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

UPDATE (16.02.24): We released v0.3 of the benchmark code. The main branch will continue as v1.0-beta which has changes that effect the game code. Follow this guide to update your game.

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

The cLLM (chat-optimized Large Language Model, "clem") framework tests such models' ability to engage in games – rule-constituted activities played using language. The framework is a systematic way of probing for the situated language understanding of language using agents.

This repository contains the code for setting up the framework and implements a number of games that are further discussed in

Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents (arXiv:2305.13455). arXiv. https://doi.org/10.48550/arXiv.2305.13455

Evaluation Results

On the main project website , under leaderboard.

Game details

A Simple Word Game: taboo
A Word-Guessing Game Based on Clues: wordle
Drawing Instruction Giving and Following: image
An ASCII Picture Reference Game: reference
Scorekeeping: private and shared

Using the benchmark

This repository is tested on Python 3.8+

We welcome you to contribute to or extend the benchmark with your own games and models. Please simply open a pull request. You can find more information on how to use the benchmark in the links below.

Name		Name	Last commit message	Last commit date
Latest commit History 333 Commits
backends		backends
clemgame		clemgame
docs		docs
evaluation		evaluation
games		games
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat-two-tracks.css		chat-two-tracks.css
corregir.txt		corregir.txt
extend_raw.py		extend_raw.py
logging.yaml		logging.yaml
model_runner.sh		model_runner.sh
output.out		output.out
pipeline_clembench.sh		pipeline_clembench.sh
pipeline_huggingfaces.sh		pipeline_huggingfaces.sh
pipeline_llama2_hf.sh		pipeline_llama2_hf.sh
prepare_path.sh		prepare_path.sh
requirements.txt		requirements.txt
requirements_hf.txt		requirements_hf.txt
run.sh		run.sh
setup.sh		setup.sh
setup_hf.sh		setup_hf.sh
setup_llamacpp_cuda122.sh		setup_llamacpp_cuda122.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UPDATE (16.02.24): We released v0.3 of the benchmark code. The main branch will continue as v1.0-beta which has changes that effect the game code. Follow this guide to update your game.

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Evaluation Results

Game details

Using the benchmark

About

Releases

Packages

Languages

License

redspottedbittern/clembench

Folders and files

Latest commit

History

Repository files navigation

UPDATE (16.02.24): We released v0.3 of the benchmark code. The main branch will continue as v1.0-beta which has changes that effect the game code. Follow this guide to update your game.

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Evaluation Results

Game details

Using the benchmark

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages