An evaluation pipeline for autoregressive language models using direct probability measurement for minimal pairs.
This pipeline evaluates language models by reading out the conditional log probabilities for minimal pairs of sentences. In each pair, one sentence is considered correct, while the other contains a minimal violation. The model is expected to assign a lower probability to the incorrect sentence.
By using a sufficient number of test items targeting specific linguistic phenomena, the accuracy of the model’s probability assignments provides an indication of its linguistic capabilities and understanding of these phenomena. Assessing models at different training checkpoints allows for analyzing learning dynamics of selected phenomena.
| AI2-OLMo | EleutherAI-Pythia |
|---|---|
| Huggingface Suite | Huggingface Suite |
| Github | Github |
| Technical Report | Technical Report |
| Website | Website |
Both models were released in different parameter sizes at different intermediate training checkpoints (revisions). This makes it possible to test for emerging capabilities across parameter scale and training time.
- tested on Python
3.12.x,3.11.x,3.10.x - requires GPU with
CUDA >= 12.1support (smaller models can run on CPU, but not recommended)
- recommended: use uv package manager for a fast setup
uv venv# macOS / Linux
source .venv/bin/activate# Windows
.venv\Scripts\activateuv pip install -r requirements.txtconda env create -f environment.ymlconda activate pipeAn example dataset for testing can be found in the data folder.
Additional datasets can easily be integrated and tested.
Please refer to the corresponding README.md in the folder for more details.
Run the Python script and specify one or more (space-separated) datasets, the model, and optionally the --revision (defaults to main, final checkpoint for all models).
To access different intermediate training checkpoints (revisions), check their Huggingface suites, select Files and versions and choose a corresponding branch. For instance, for Pythia or OLMo.
# Template
python run_eval.py {dataset} [dataset2] ... {model} {optional: --revision}-
Run a single dataset, here: dtfit on the
maincheckpoint (no revision specified):python run_eval.py dtfit EleutherAI/pythia-14m
-
Run the same dataset with a specified revision / checkpoint:
python run_eval.py dtfit EleutherAI/pythia-14m --revision step2000
- Performance
- fix batch support
- Optional
- add support for commercial APIs as upper bound
- extract & analyze contextual word embeddings
- test other open models with checkpoints?
togethercomputer/RedPajama-INCITE-7B-BaseTinyLlama/TinyLlama-1.1BZyphra/Zamba-7b- Ablation Models?
- checkpoints available for different common datasets for pretraining
- Maximilian Krupop