Diffusion Beats Autoregressive in Data-Constrained Settings

This is the official implementation of our paper Diffusion Beats Autoregressive in Data-Constrained Settings by Mihir Prabhudesai*, Mengning Wu*, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak

Abstract

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. Finally, we explain why diffusion models excel in this regime: their randomized masking objective implicitly trains over a rich distribution of token orderings, acting as an implicit data augmentation that AR's fixed left-to-right factorization lacks. Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm.

TODOS

Quick Start Instructions

Want to quickly try our Diffusion beat AR scenario?

Quickstart guide for training our best found config on Diffusion and AR for 100M unique tokens. This code is tested on 8x H100 gpus.

Follow these 4 simple steps:

Step 1: Install Dependencies

bash install.sh

NOTE: This script will install the default environment for the paper via micromamba. If you want to use a different environment or want to manually go through the install script or meet any issues during installation, please refer to FULL_README.md.

Step 2: Download 100M Token Dataset

Download the 100M token dataset and the validation dataset from here and place it in the data directory. We provide a script to automatically download the dataset and place them in the data directory.

bash download.sh

Step 3: Run Training Scripts

# Train AR model
bash examples_scaling/quickstart/train_ar.sh

# Train diffusion model
bash examples_scaling/quickstart/train_mdm.sh

Step 4: Run Evaluation Scripts

# Evaluate AR model
bash examples_scaling/quickstart/eval_ar.sh

# Evaluate diffusion model
bash examples_scaling/quickstart/eval_mdm.sh

That's it! The scripts will automatically: Use our best configurations for the 100M token scenario to train both AR and Diffusion models with optimal hyperparameters. This will show you how Diffusion outperforms AR in data-constrained settings

Expected Result: Diffusion should achieve lower validation loss and better downstream performance compared to AR when training on limited data.

Pretrained Checkpoints: We provided the pretrained checkpoints for the best configurations for AR and Diffusion models trained on 100M tokens in Hugging Face.

Full README

For comprehensive instructions on training models with different configurations, data preprocessing, and advanced experimental setups, please refer to FULL_README.md.

The full documentation covers:

Detailed installation for different architectures (x86_64, AArch64, B200s)
Complete data preprocessing workflows
All experimental configurations from the paper
Advanced training and evaluation options
Downstream task evaluation

Acknowledgments

This codebase is built upon several foundational repositories and implementations:

Base Framework: Built on Megatron-DeepSpeed as a detached fork
Implementation References: Incorporates modifications for data preprocessing and improvements from TurkuNLP's Megatron-DeepSpeed
Core Methodology: Implements Masked Diffusion Model architecture and evaluation based on SMDM research and implementation

Citation

If you find this work useful in your research, please cite:

@article{prabhudesai2025diffusion,
  title={Diffusion Beats Autoregressive in Data-Constrained Settings},
  author={Prabhudesai, Mihir and Wu, Mengning and Zadeh, Amir and Fragkiadaki, Katerina and Pathak, Deepak},
  journal={arXiv preprint arXiv:2507.15857},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2,259 Commits
.github/workflows		.github/workflows
examples_scaling		examples_scaling
megatron		megatron
tasks		tasks
tests		tests
tools		tools
utils		utils
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CODEOWNERS		CODEOWNERS
FULL_README.md		FULL_README.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
download.sh		download.sh
install.sh		install.sh
pretrain_diff_gpt.py		pretrain_diff_gpt.py
pretrain_gpt.py		pretrain_gpt.py
preview.png		preview.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Diffusion Beats Autoregressive in Data-Constrained Settings

Abstract

TODOS

Quick Start Instructions

Step 1: Install Dependencies

Step 2: Download 100M Token Dataset

Step 3: Run Training Scripts

Step 4: Run Evaluation Scripts

Full README

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Contributors 103

Uh oh!

Languages

License

wmn-231314/diffusion-data-constraint

Folders and files

Latest commit

History

Repository files navigation

Diffusion Beats Autoregressive in Data-Constrained Settings

Abstract

TODOS

Quick Start Instructions

Step 1: Install Dependencies

Step 2: Download 100M Token Dataset

Step 3: Run Training Scripts

Step 4: Run Evaluation Scripts

Full README

Acknowledgments

Citation

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 103

Uh oh!

Languages

Packages