Code and data of our paper:
Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
Vaishnavh Nagarajan*1, Chen Henry Wu*2, Charles Ding2, Aditi Raghunathan2
1Google Research, 2Carnegie Mellon University
ICML 2025 (spotlight)
Paper | Data examples
Note: we have experiments with both Gemma 2B and GPT-2/SEDD in the paper, while this repo only contains the GPT-2/SEDD code.
sibling-discovery # code for Sibling Discovery
├── ntp # with next-token training
├── teacherless # with teacherless training
└── diffusion # with diffusion training
triangle-discovery # code for Triangle Discovery
├── ...
circle-construction # code for Circle Construction
├── ...
line-construction # code for Line Construction
├── ...
simpletransformers # helper code for Transformer models
We use simpletransformers
to train and test the Transformer models (for NTP and teacherless training).
To set up, please follow the installation instructions in simpletransformers/README.md
.
For diffusion model training and inference, we use Score-Entropy-Discrete-Diffusion. We have provided self-contained copies under {task}/diffusion/
, so no need to clone the repo. Please follow the dependency installation instructions in their README.
All experiments can be run on a single A6000 GPU. Batch sizes are tuned on this device.
We provide Jupyter notebooks to replicate the data generation process. Paths in the notebooks need to be adjusted to your local environment.
To get data with hash-conditioning, run all blocks in:
sibling-discovery/ntp/sibling.ipynb
Here is an example of what the data would look like: sibling example
To get data without hash-conditioning, run all blocks in:
sibling-discovery/ntp/sibling_no_hash.ipynb
To get data with hash-conditioning, run all blocks in:
triangle-discovery/ntp/triangle.ipynb
Here is an example of what the data would look like: triangle example
To get data without hash-conditioning, run all blocks in:
triangle-discovery/ntp/triangle_no_hash.ipynb
To get both data with and without hash-conditioning, run all blocks in:
circle-construction/ntp/circle.ipynb
Here is an example of what the data would look like: circle example
To get both data with and without hash-conditioning, run all blocks in:
line-construction/ntp/line.ipynb
Here is an example of what the data would look like: line example
To run the experiments, replace {task}
with sibling-discovery
, triangle-discovery
, circle-construction
, or line-construction
. All experiments can be run on a single A6000 GPU. Batch sizes are tuned on this device.
Working directory for NTP is {task}/ntp
:
cd {task}/ntp
Run the training script:
bash run_train.sh
Run the evaluation script:
bash run_eval.sh
The evaluation script will print the scores for each saved checkpoint.
Working directory for teacherless training is {task}/teacherless
:
cd {task}/teacherless
We provide Jupyter notebooks to preprocess the dataset for teacherless training. Run all blocks in (adjust the paths in the notebook to your local environment):
{task}_hybrid.ipynb
Run the training script:
bash run_train.sh
Run the evaluation script:
bash run_eval.sh
The evaluation script will print the scores for each saved checkpoint.
Working directory for diffusion models should be {task}/diffusion
:
cd {task}/diffusion
Run the training script:
bash run_train.sh
Run the evaluation script:
bash run_eval.sh
The evaluation script will print the scores for each saved checkpoint.
If you find this code useful, please consider citing our paper:
@misc{nagarajan2025roll,
title={Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction},
author={Vaishnavh Nagarajan and Chen Henry Wu and Charles Ding and Aditi Raghunathan},
year={2025},
eprint={2504.15266},
archivePrefix={arXiv},
primaryClass={cs.LG}
}