Minimal Pairs Eval Pipeline

An evaluation pipeline for autoregressive language models using direct probability measurement for minimal pairs.

This pipeline evaluates language models by reading out the conditional log probabilities for minimal pairs of sentences. In each pair, one sentence is considered correct, while the other contains a minimal violation. The model is expected to assign a lower probability to the incorrect sentence.

By using a sufficient number of test items targeting specific linguistic phenomena, the accuracy of the model’s probability assignments provides an indication of its linguistic capabilities and understanding of these phenomena. Assessing models at different training checkpoints allows for analyzing learning dynamics of selected phenomena.

Overview

Minimal Pairs Eval Pipeline
- Overview
- Models
- Setup
  - venv
  - conda
- Datasets for evaluation
- Running experiments
- ToDo
- Author

Models

AI2-OLMo	EleutherAI-Pythia
Huggingface Suite	Huggingface Suite
Github	Github
Technical Report	Technical Report
Website	Website

Both models were released in different parameter sizes at different intermediate training checkpoints (revisions). This makes it possible to test for emerging capabilities across parameter scale and training time.

Setup

tested on Python 3.12.x, 3.11.x, 3.10.x
requires GPU with CUDA >= 12.1 support (smaller models can run on CPU, but not recommended)

venv

recommended: use uv package manager for a fast setup

uv venv

# macOS / Linux
source .venv/bin/activate

# Windows
.venv\Scripts\activate

uv pip install -r requirements.txt

conda

conda env create -f environment.yml

conda activate pipe

Datasets for evaluation

An example dataset for testing can be found in the data folder. Additional datasets can easily be integrated and tested. Please refer to the corresponding README.md in the folder for more details.

Running experiments

Run the Python script and specify one or more (space-separated) datasets, the model, and optionally the --revision (defaults to main, final checkpoint for all models).

To access different intermediate training checkpoints (revisions), check their Huggingface suites, select Files and versions and choose a corresponding branch. For instance, for Pythia or OLMo.

# Template
python run_eval.py {dataset} [dataset2] ... {model} {optional: --revision}

Run a single dataset, here: dtfit on the main checkpoint (no revision specified):
```
python run_eval.py dtfit EleutherAI/pythia-14m
```

Run the same dataset with a specified revision / checkpoint:

python run_eval.py dtfit EleutherAI/pythia-14m --revision step2000

ToDo

Performance
- fix batch support
Optional
- add support for commercial APIs as upper bound
- extract & analyze contextual word embeddings
- test other open models with checkpoints?
  - togethercomputer/RedPajama-INCITE-7B-Base
  - TinyLlama/TinyLlama-1.1B
  - Zyphra/Zamba-7b
  - Ablation Models?
    - checkpoints available for different common datasets for pretraining

Author

Maximilian Krupop

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/workflows		.github/workflows
bin		bin
data		data
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
run_eval.py		run_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minimal Pairs Eval Pipeline

Overview

Models

Setup

venv

conda

Datasets for evaluation

Running experiments

ToDo

Author

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Minimal Pairs Eval Pipeline

Overview

Models

Setup

venv

conda

Datasets for evaluation

Running experiments

ToDo

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages