GBA Eval

We give frontier AI coding agents 24 hours to write a complete software GBA emulator with WebAssembly support. We grade the emulators they produce against Mesen2, one of the most accurate software GBA emulators available.

GBA Eval is analogous to the work Mechanize does with top AI labs when we make environments for evaluating and training frontier LLMs, and we hope it gives a concrete sense of what we mean when we talk about environments and grading. If designing and building high-quality evaluations sounds like something you'd want to do, we're hiring software engineers.

See the attempts

The nine emulators produced by the May 2026 leaderboard lineup — their full commit-by-commit autosave history and a rendered chat-log transcript for each — are published at mechanize-work/gba-eval-attempts. Each model has its own subrepo so you can clone just the one you're interested in.

Quickstart

Grade a wasm, or run your own agent inside the benchmark's container. Docker is the only required dep.

git submodule update --init --recursive
git lfs install && git lfs pull

./quickstart/smoke-test.sh                                          # 30s preflight
./quickstart/grade.sh candidates/gba-core/gba_core_shim.wasm baseline

See quickstart/README.md for the full walkthrough — running your own agent in the container, grading a wasm your agent produced with --from-container, and the --native path that skips docker.

How grading works

# One-time: precompute reference frames + per-replay thresholds.
# Writes corpus/reference-cache/*.refcache (git-LFS tracked).
cargo run -p grader --release -- --precompute corpus/

# Per candidate: grade wasm → per-testcase JSON + summary.json
cargo run -p grader --release -- candidate.wasm corpus/ results/<candidate>/

# Custom reference: any wasm implementing the ABI works. Default is the
# bundled Mesen2 build at reference/mesen.wasm.
cargo run -p grader --release -- --reference my-ref.wasm candidate.wasm corpus/ results/<candidate>/

Scoring: three sections

The overall score is the weighted sum of three independent sections. Default weights, configurable in corpus/grader.yaml:

Section	Weight	What it measures
Gameplay Replays	60%	Real gameplay with button presses — end-to-end behavior under input
Procedural Tests	20%	CPU, memory, timers, DMA — deterministic self-checking ROMs
Audio	20%	Per-frame log-mel spectral distance vs Mesen2

overall = 0.60 × replay + 0.20 × procedural + 0.20 × audio

Building from source

The bundled reference/mesen.wasm is enough to grade. To rebuild it from source (Mesen2 fork at third_party/mesen):

git submodule update --init --recursive
reference/build-mesen.sh wasm      # → reference/mesen.wasm
cargo build --workspace

License

This repository is multi-licensed.

Path	License	Why
`spec/`, `corpus/` (non-ROM), `candidates/`, `harness/`, `quickstart/`, repo root	MIT	Our original work
`reference/build-mesen.sh`, `reference/mesen_step.cpp`	GPL-3.0	Build glue + shim for Mesen2
`third_party/mesen/`	GPL-3.0	Upstream Mesen2 (submodule)
`corpus/roms/`	Per ROM	Homebrew & test ROMs under upstream licenses
`reference/mesen.wasm`	GPL-3.0	Compiled from Mesen2
`spec/gba_bios_stub.bin`	MIT	Clean-room stub, not a Nintendo dump

Refer to LEGAL.md for more details.

For copyright, trademark, content-removal, or DMCA matters, contact stephen@mechanize.work — see CONTACT.md.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
candidates		candidates
corpus		corpus
harness		harness
quickstart		quickstart
reference		reference
spec		spec
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTACT.md		CONTACT.md
Cargo.toml		Cargo.toml
LEGAL.md		LEGAL.md
LICENSE		LICENSE
LICENSE-GPL3		LICENSE-GPL3
LICENSE-MIT		LICENSE-MIT
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GBA Eval

See the attempts

Quickstart

How grading works

Scoring: three sections

Building from source

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GBA Eval

See the attempts

Quickstart

How grading works

Scoring: three sections

Building from source

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages