ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs

Wonjun Kang^*1,5, Kevin Galim^*1, Seunghyuk Oh^*1, Minjae Lee¹, Yuchen Zeng^2,3, Shuibai Zhang², Coleman Hooper⁴, Yuezhou Hu⁴, Hyung Il Koo¹, Nam Ik Cho⁵, Kangwook Lee^2,6

¹FuriosaAI, ²UW-Madison, ³Microsoft Research, ⁴UC Berkeley, ⁵Seoul National University, ⁶KRAFTON AI

🚀 Overview

Diffusion LLMs (dLLMs) promise faster generation via parallel decoding. However, this speed often comes at the cost of quality, as they ignore token dependencies, an issue that existing benchmarks do not sufficiently capture. To address this issue, we introduce ParallelBench, the first benchmark designed to rigorously test this trade-off through realistic tasks that humans and autoregressive (AR) LLMs can easily solve, but which cause dLLMs to collapse as parallelism grows. We release ParallelBench to drive research towards truly efficient dLLMs that can overcome this challenge.

📝 Abstract

While most autoregressive LLMs are constrained to one-by-one decoding, diffusion LLMs (dLLMs) have attracted growing interest for their potential to dramatically accelerate inference through parallel decoding. Despite this promise, the conditional independence assumption in dLLMs causes parallel decoding to ignore token dependencies, inevitably degrading generation quality when these dependencies are strong. However, existing works largely overlook these inherent challenges, and evaluations on standard benchmarks (e.g., math and coding) are not sufficient to capture the quality degradation caused by parallel decoding. To address this gap, we first provide an information-theoretic analysis of parallel decoding. We then conduct case studies on analytically tractable synthetic list operations from both data distribution and decoding strategy perspectives, offering quantitative insights that highlight the fundamental limitations of parallel decoding. Building on these insights, we propose **ParallelBench**, the first benchmark specifically designed for dLLMs, featuring realistic tasks that are trivial for humans and autoregressive LLMs yet exceptionally challenging for dLLMs under parallel decoding. Using ParallelBench, we systematically analyze both dLLMs and autoregressive LLMs, revealing that: (i) dLLMs under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speedup without compromising quality. Our findings underscore the pressing need for innovative decoding methods that can overcome the current speed-quality trade-off. We are releasing our benchmark to help accelerate the development of truly efficient dLLMs.

🌟 Features

Information-Theoretic Analysis: Proves that parallel decoding has fundamental error bounds when tokens depend on each other, showing even perfect models struggle as we increase parallelism on tasks requiring strong token coordination.
Quantitative Case Studies: Analytically tractable synthetic list operations (Copy, Replace, Shuffle) with closed-form accuracy formulas demonstrate fundamental limitations: specific tasks show inevitable quality degradation under parallel decoding.
Realistic Benchmark Tasks: 17 tasks across Waiting Line, Text Writing, and Puzzles—all trivial for humans and AR LLMs—reveal severe quality degradation in dLLMs under parallel decoding in real-world scenarios.

⚙️ Setup

These steps will guide you through setting up the necessary environment and dependencies.

1. Prerequisites

Conda: For managing the environment.
NVIDIA GPU: CUDA >= 11.8.
Java Development Kit (JDK): Required only for grammar-based evaluation metrics.

2. Create Conda Environment

First, create and activate the conda environment. We use Python 3.10.

conda create -n parallelbench python=3.10 -y
conda activate parallelbench

3. Install Python Dependencies

We use uv for faster package installation. The following commands will install PyTorch, vLLM for the LLM baselines, and all other required packages from requirements.txt.

# Install uv, a fast package installer
pip install uv

# Install core dependencies
uv pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu118
uv pip install -r requirements.txt
uv pip install vllm  # optional for LLM evaluation

4. Install Java (Optional)

If you need to run the grammar-based evaluations, install the JDK via conda:

conda install -c conda-forge openjdk=17

⚡ Quickstart

Here's a simple example of how to load a model and run it on a ParallelBench task. For a more in-depth example, see the demo.py script.

📋 View Available Tasks

🔄 Waiting Line
- waiting_line/copy
- waiting_line/insert_index
- waiting_line/insert_random
- waiting_line/remove_index
- waiting_line/remove_random
- waiting_line/replace_index
- waiting_line/replace_random
- waiting_line/reverse
- waiting_line/shuffle
- waiting_line/sort
✍️ Text Writing
- paraphrase_summarize/chatgpt-paraphrases
- paraphrase_summarize/samsum
- words_to_sentence/easy
- words_to_sentence/medium
- words_to_sentence/hard
🧠 Puzzle
- puzzle/latin_square_n4
- puzzle/sudoku_n4_12

import torch
from transformers import AutoModel, AutoTokenizer
from dataset.parallel_bench import ParallelBench

# 1. Load the model and tokenizer
model = AutoModel.from_pretrained(
    "Dream-org/Dream-v0-Instruct-7B",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "Dream-org/Dream-v0-Instruct-7B",
    trust_remote_code=True
)

# 2. Load a benchmark task and get a sample
task_name = "waiting_line/copy"
dataset = ParallelBench(task_name)
sample = dataset[0] # Get the first sample from the task

# 3. Prepare input from the benchmark sample
messages = sample["input"]["messages"]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# 4. Generate the model's output
generated_ids = model.diffusion_generate(input_ids, max_tokens=32)
response = tokenizer.decode(generated_ids[0][len(input_ids[0]):], skip_special_tokens=True)

# 5. Compare the model's output with the reference label
print(f"Task: {task_name}")
print(f"Prompt: {messages[-1]['content']}")
print(f"Reference Label: {sample['label']}")
print(f"Model Output:    {response}")

# To get the final score, run compute_metrics
metrics = dataset.compute_metrics([response], [sample["label"]])
print(f"Metrics: {metrics}")

🛠️ Create Your Own Tasks

You can easily generate custom tasks from YAML configuration files. For example, to create new copy and reverse tasks:

PYTHONPATH=. python dataset/parallel_bench/data/task.py --task test/copy_reverse/all

This command uses the configurations specified in dataset/parallel_bench/data/task_configs/.

🚀 Running Evaluations

🔑 Configuration

Before running the evaluations, you must export the necessary API keys as environment variables.

# For logging results
export WANDB_API_KEY="your_weights_and_biases_key"

# For commercial model APIs
export ANTHROPIC_API_KEY="your_anthropic_key"      # For Haiku
export INCEPTION_API_KEY="your_mercury_model_key"  # For Mercury

All experiments are launched using the run_all.py script. The general command structure is:

python run_all.py eval.py --device <gpu_ids> --cfg <path_to_config_file>

Main Benchmark Reproduction

This section covers the commands to reproduce the main benchmark results from our paper. The following commands run evaluation on two GPUs.

LLaDA 1.5:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/llada_1_5_all_tasks_list.yaml

Dream:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/dream_all_tasks_list.yaml

Diffucoder:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/diffucoder_all_tasks_list.yaml

LLaDA 1.0:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/llada_1_0_all_tasks_list.yaml

dLLM vs. Autoregressive LLM Comparison

This section includes the commands for the comparative analysis between our models and other strong LLM baselines.

LLaDA 1.5:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llada_1_5_all_tasks_list.yaml

Dream:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/dream_all_tasks_list.yaml

Diffucoder:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/diffucoder_all_tasks_list.yaml

LLaDA 1.0:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llada_1_0_all_tasks_list.yaml

Mercury (requires single GPU):

python run_all.py eval.py --device 0 --cfg cfg/paper/dllm_vs_llm/mercury_all_tasks_list.yaml

Haiku (requires single GPU):

python run_all.py eval.py --device 0 --cfg cfg/paper/dllm_vs_llm/haiku_all_tasks_list.yaml

LLM Baselines (via vLLM):

python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llm_all_tasks_list.yaml

📊 Results

All evaluation metrics and generated outputs are logged to Weights & Biases (wandb). Please ensure you have configured your API key and project settings.

🙏 Acknowledgements

This project builds upon the work of several fantastic open-source repositories. We extend our sincere thanks to the original authors for their contributions to the community.

📖 Citation

@article{kang2025parallelbench,
  title={ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs},
  author={Kang, Wonjun and Galim, Kevin and Oh, Seunghyuk and Lee, Minjae and Zeng, Yuchen and Zhang, Shuibai and Hooper, Coleman and Hu, Yuezhou and Koo, Hyung Il and Cho, Nam Ik and others},
  journal={arXiv preprint arXiv:2510.04767},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs

🚀 Overview

🌟 Features

⚙️ Setup

1. Prerequisites

2. Create Conda Environment

3. Install Python Dependencies

4. Install Java (Optional)

⚡ Quickstart

🛠️ Create Your Own Tasks

🚀 Running Evaluations

🔑 Configuration

Main Benchmark Reproduction

dLLM vs. Autoregressive LLM Comparison

📊 Results

🙏 Acknowledgements

📖 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cfg/paper		cfg/paper
dataset		dataset
docs		docs
model		model
utils		utils
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
eval.py		eval.py
requirements.txt		requirements.txt
run_all.py		run_all.py

furiosa-ai/ParallelBench

Folders and files

Latest commit

History

Repository files navigation

ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs

🚀 Overview

🌟 Features

⚙️ Setup

1. Prerequisites

2. Create Conda Environment

3. Install Python Dependencies

4. Install Java (Optional)

⚡ Quickstart

🛠️ Create Your Own Tasks

🚀 Running Evaluations

🔑 Configuration

Main Benchmark Reproduction

dLLM vs. Autoregressive LLM Comparison

📊 Results

🙏 Acknowledgements

📖 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages