Code for evals measuring frontier model capabilities.
- PaperBench: End-to-end replication of state-of-the-art AI papers. Paper | Blog
- SWE-Lancer: Real freelance software engineering tasks with end-to-end tests. Paper | Blog
We manage environments with uv. Install uv once, then run uv sync (or uv pip install -r ...) inside the project of interest to create its virtual environment from the checked-in uv.lock.
Each eval directory documents how to reproduce runs, configure models, and interpret results. Start with the suite README.md, then consult any scripts under scripts/ or runtime_*/ directories for orchestration details. When in doubt:
cdinto the eval directory.uv syncto install dependencies.- Follow the local instructions in the
README.md.
.
├── pyproject.toml # Shared tooling configuration (Ruff, Black, etc.)
└── project/
├── common/ # Shared libraries
├── paperbench/ # PaperBench eval
└── swelancer/ # SWE-Lancer eval
Each eval directory is its own isolated project with a README.md, pyproject.toml and uv.lock.
- Create or activate the environment for the project you are working on with
uv. Example for PaperBench:cd project/paperbenchuv syncuv run pytest
- Code style and linting use Ruff (with autofix profiles in
pyproject.tomlandproject/common/tooling/ruff_autofix_minimal.toml) and Black. Runuv run ruff check --fixor use the provided Poe/maketasks where available. - Shared utilities live under
project/common; changes there may affect multiple evals. Bump the relevant editable dependencies if you create new shared subpackages.