A benchmark for evaluating Large Language Models through the game of Bananagrams. LLMs must build valid crossword-style boards using structured output formats, demonstrating spatial reasoning, constraint satisfaction, and multi-turn strategic decision-making.
- Setup: Each LLM receives a starting hand of tiles (21 for 1-4 players)
- Turn Loop: On each turn:
- LLM receives current hand, game state, and feedback from previous turn
- LLM generates a board specification
- Board is validated against structure rules, grid conflicts, and dictionary
- Auto-PEEL: If board is valid and uses all tiles, everyone draws one tile
- Auto-BANANAS: If bunch is empty and board is valid, player wins!
- Actions: Players can use
DUMP Xto exchange a difficult tile for three new ones - End: The first LLM to create a valid board using all their tiles when the bunch is empty wins the game!
- Python 3.13+
- uv package manager
# Clone the repository
git clone https://github.com/yourusername/banana-bench.git
cd banana-bench
# Install dependencies with uv
uv sync
# Set up your LLM API key (for OpenAI, Anthropic, etc.)
export OPENAI_API_KEY="your-api-key"To see a full configuration, run output, and visualization, see the /example directory.
Create a YAML config file to customize your benchmark:
Example
max_turns: 500
seed: 42
players:
- model: gpt-5.2-2025-12-11
name: "GPT-5.2"
temperature: 1
- model: gpt-5.2-2025-12-11
name: "GPT-5.2 - Medium Reasoning"
temperature: 1
reasoning_effort: mediumKey Features:
- Each player can have different model, temperature, and max_tokens
- Pass provider-specific kwargs (like Claude's
thinkingparameter) - Mix and match any LiteLLM-supported parameters
- Optional custom names
- Number of players is automatically determined by the players list
# Run a benchmark with the example config
uv run python -m src.main example_config.yaml
# Run with verbose output to see the game play out
uv run python -m src.main example_config.yaml --verbose
# Save results to a specific location
uv run python -m src.main example_config.yaml --output results/my_run.jsonIf a benchmark is interrupted (API quota limits, crashes, etc.), you can resume from where it left off:
# Resume from a saved result file
uv run python -m src.main --resume interrupted_run.json --verbose
# Resume and save to a specific output file
uv run python -m src.main --resume interrupted_run.json --output final_run.jsonBanana-Bench includes an interactive HTML visualizer that lets you watch games play out turn-by-turn with animations and player insights.
Generate during benchmark run:
# Run benchmark and automatically create visualizer
uv run python -m src.main configs/example.yaml --visualizeGenerate from existing results:
# Convert any results JSON to an interactive visualizer
uv run python -m src.visualize game.json
# Specify custom output location
uv run python -m src.visualize game.json --output my_viz.htmlThe verifier checks boards through multiple stages:
- Parsing: Board format must be correct
- Structure: Letter matches, perpendicularity, index bounds
- Grid: No overlapping letter conflicts
- Words: All words must be in TWL dictionary
- Tiles: Must use exactly the tiles in hand
Cascading Errors: The system filters downstream errors to show only root causes:
- Parsing errors hide all downstream validation
- Structural errors hide grid conflicts and accidental words
- Errors limited to 5 maximum with actionable tips
Benchmark results are saved as JSON files containing:
- Full configuration
- Turn-by-turn history with validations
- Complete conversation history for each player
- Final game state and outcome
- Timing information
- Per-turn and total token counts (prompt, completion, and total tokens)
Thanks to Michael Fogleman for providing the Scrabble Tournament Word List verification logic and data.
Contributions welcome! Feel free to submit an issue or documented PR.
Todo List
- Better validation of various models and providers
- Ranking and List
- Organizing codebase

