Emoji-Bench is a fixed 100-example benchmark for testing whether language models can recover from an incorrect prefilled derivation step when prompted only with Please continue.
The public repo includes the benchmark dataset and the scripts needed to run models, score predictions deterministically, and generate local final-answer plots. It does not include our model outputs, leaderboard result artifacts, LLM-as-judge artifacts, or dataset-generation code.
The fixed benchmark input lives at:
artifacts/emoji-bench-dataset-100/
├── test.jsonl
├── manifest.json
└── README.md
Each test.jsonl row contains a three-turn continuation task:
turn_1_user: the formal emoji system, expression, and step format.turn_1_assistant_prefill: a partial assistant derivation ending on an injected error.- Turn 2 is supplied by the evaluator, usually
Please continue.
Scoring compares the model's extracted Final Output: against ground_truth_final_output.
Requires Python >=3.11.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtSet API keys for the providers you plan to run:
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
export MISTRAL_API_KEY=...Gemini and Mistral use plain HTTP from the standard library. OpenAI and Anthropic use their official SDKs.
Run the default benchmark:
./run.sh artifacts/emoji-bench-dataset-100 -- --max-concurrent 8run.sh will:
- Run
scripts/evaluate_continuation.pyfor each configured model. - Write predictions under
artifacts/evals/. - Score each completed cell with
scripts/score_continuation.py. - Generate final-answer plots with
scripts/plot_final_answer.py.
Generated outputs are local artifacts and are intentionally not checked in.
python scripts/evaluate_continuation.py \
artifacts/emoji-bench-dataset-100 \
--model gpt-5.4-mini-reasoning-xhigh \
--output-dir artifacts/evals/gpt-5.4-mini-reasoning-xhigh \
--max-concurrent 8
python scripts/score_continuation.py \
artifacts/evals/gpt-5.4-mini-reasoning-xhighScore an existing prediction directory:
python scripts/score_continuation.py artifacts/evals/<run-dir>This writes:
scores.jsonl
score_summary.json
The headline metric is:
| Metric | Meaning |
|---|---|
final_answer_correct_rate |
Extracted Final Output: equals ground_truth_final_output |
The summary also includes regex diagnostic buckets such as detect_recover, silent_recovery, blind_wrong_branch, and extraction_failed.
Generate plots from locally scored eval directories:
python scripts/plot_final_answer.pyPlots are written to:
artifacts/plots/
artifacts/emoji-bench-dataset-100/ fixed benchmark dataset
emoji_bench/eval/ run paths and shared runner
emoji_bench/providers/ provider request plumbing
emoji_bench/scoring/ deterministic final-answer scoring
scripts/evaluate_continuation.py run one model
scripts/score_continuation.py score predictions
scripts/plot_final_answer.py plot local score summaries
run.sh batch runner
MIT. See LICENSE.