This repository contains the Lost in Translation (LiT) Benchmark, introduced in the paper “Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss.”
lit-benchmark/
├── data/ # Benchmark splits and convenience subset files
├── annotations/ # Raw category annotation views
├── website/ # GitHub Pages website source
└── src/ # Runtime, judge pipeline, and table builders
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtTo rerun translation or judging, set:
export OPENROUTER_API_KEY=...Example multi-hop run:
python3 src/main.py \
--model deepseek/deepseek-v3.2-exp \
--dataset_name extended \
--language_sequence japanese korean chinese russian \
--judge_model x-ai/grok-4.1-fast \
--exp_name seqBy default, generated traces are written to artifacts/traces/ and judge outputs are written to artifacts/scores/.
Those runtime artifacts are not tracked in Git; use the Hugging Face dataset for the released trace and score files.
The runtime supports these dataset names: extended, lit, robustness, abstracts, pragmatics, informal.
For batch runs:
Primary benchmark files:
data/lit.jsonldata/extended.jsonldata/robustness.jsonl
Convenience subset files:
data/abstracts.jsonldata/pragmatics.jsonldata/informal.jsonl
Raw annotation views:
annotations/abstracts.jsonlannotations/pragmatics.jsonlannotations/robustness.jsonl
extended.jsonl is the full 260-example release:
40abstracts items120pragmatics items40informal items60robustness items
lit.jsonl contains the 200 non-robustness evaluation examples: abstracts + pragmatics + informal.
lit.jsonl and extended.jsonl intentionally keep only id, sentence, and group.
Subset-specific metadata stays in the convenience files and raw annotation views:
data/abstracts.jsonlincludes the abstracts category labeldata/pragmatics.jsonlincludes the pragmatics partition metadatadata/robustness.jsonlincludes the robustness category metadataannotations/*.jsonlpreserve the released annotation views keyed by the same sample IDs
See LICENSE.