Banana-Bench

A benchmark for evaluating Large Language Models through the game of Bananagrams. LLMs must build valid crossword-style boards using structured output formats, demonstrating spatial reasoning, constraint satisfaction, and multi-turn strategic decision-making.

How it Works

Setup: Each LLM receives a starting hand of tiles (21 for 1-4 players)
Turn Loop: On each turn:
- LLM receives current hand, game state, and feedback from previous turn
- LLM generates a board specification
- Board is validated against structure rules, grid conflicts, and dictionary
- Auto-PEEL: If board is valid and uses all tiles, everyone draws one tile
- Auto-BANANAS: If bunch is empty and board is valid, player wins!
Actions: Players can use DUMP X to exchange a difficult tile for three new ones
End: The first LLM to create a valid board using all their tiles when the bunch is empty wins the game!

Setup and Install

Requirements

Python 3.13+
uv package manager

Installation

# Clone the repository
git clone https://github.com/yourusername/banana-bench.git
cd banana-bench

# Install dependencies with uv
uv sync

# Set up your LLM API key (for OpenAI, Anthropic, etc.)
export OPENAI_API_KEY="your-api-key"

Usage

To see a full configuration, run output, and visualization, see the /example directory.

Configuration

Create a YAML config file to customize your benchmark:

Example

max_turns: 500
seed: 42

players:
  - model: gpt-5.2-2025-12-11
    name: "GPT-5.2"
    temperature: 1

  - model: gpt-5.2-2025-12-11
    name: "GPT-5.2 - Medium Reasoning"
    temperature: 1
    reasoning_effort: medium

Key Features:

Each player can have different model, temperature, and max_tokens
Pass provider-specific kwargs (like Claude's thinking parameter)
Mix and match any LiteLLM-supported parameters
Optional custom names
Number of players is automatically determined by the players list

Kick Off

# Run a benchmark with the example config
uv run python -m src.main example_config.yaml

# Run with verbose output to see the game play out
uv run python -m src.main example_config.yaml --verbose

# Save results to a specific location
uv run python -m src.main example_config.yaml --output results/my_run.json

Resuming Interrupted Runs

If a benchmark is interrupted (API quota limits, crashes, etc.), you can resume from where it left off:

# Resume from a saved result file
uv run python -m src.main --resume interrupted_run.json --verbose

# Resume and save to a specific output file
uv run python -m src.main --resume interrupted_run.json --output final_run.json

Visualization

Banana-Bench includes an interactive HTML visualizer that lets you watch games play out turn-by-turn with animations and player insights.

Generating Visualizations

Generate during benchmark run:

# Run benchmark and automatically create visualizer
uv run python -m src.main configs/example.yaml --visualize

Generate from existing results:

# Convert any results JSON to an interactive visualizer
uv run python -m src.visualize game.json

# Specify custom output location
uv run python -m src.visualize game.json --output my_viz.html

Extras

Validation System

The verifier checks boards through multiple stages:

Parsing: Board format must be correct
Structure: Letter matches, perpendicularity, index bounds
Grid: No overlapping letter conflicts
Words: All words must be in TWL dictionary
Tiles: Must use exactly the tiles in hand

Cascading Errors: The system filters downstream errors to show only root causes:

Parsing errors hide all downstream validation
Structural errors hide grid conflicts and accidental words
Errors limited to 5 maximum with actionable tips

Results

Benchmark results are saved as JSON files containing:

Full configuration
Turn-by-turn history with validations
Complete conversation history for each player
Final game state and outcome
Timing information
Per-turn and total token counts (prompt, completion, and total tokens)

Acknowledgements

Thanks to Michael Fogleman for providing the Scrabble Tournament Word List verification logic and data.

Contributions

Contributions welcome! Feel free to submit an issue or documented PR.

Todo List

Better validation of various models and providers
Ranking and List
Organizing codebase

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
example		example
media		media
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
example_config.yaml		example_config.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Banana-Bench

How it Works

Setup and Install

Requirements

Installation

Usage

Configuration

Kick Off

Resuming Interrupted Runs

Visualization

Generating Visualizations

Extras

Validation System

Results

Acknowledgements

Contributions

License

About

Uh oh!

Releases

Packages

Languages

License

ALucek/banana-bench

Folders and files

Latest commit

History

Repository files navigation

Banana-Bench

How it Works

Setup and Install

Requirements

Installation

Usage

Configuration

Kick Off

Resuming Interrupted Runs

Visualization

Generating Visualizations

Extras

Validation System

Results

Acknowledgements

Contributions

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages