WeChat AI, Tencent
β¬οΈ Real-time comparison: Qwen3-8B-Instruct with vLLM (left) vs WeDLM-8B-Instruct (right) on the same prompt
Most diffusion language models use bidirectional attention, which breaks KV cache compatibility and fails to translate parallel prediction into actual speedups over optimized AR engines like vLLM.
WeDLM solves this by performing parallel mask recovery under standard causal attention, enabling:
- β Native KV cache compatibility (FlashAttention, PagedAttention, CUDA Graphs)
- β Direct initialization from pre-trained AR models (Qwen2.5, Qwen3)
- β Real speedups measured against production-grade vLLM baselines
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && bash install.shManual Installation
# Step 1: PyTorch
pip install torch==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129
# Step 2: flash-attn build dependencies
pip install psutil ninja packaging
# Step 3: flash-attn (requires torch first)
pip install flash-attn==2.7.4.post1 --no-build-isolation
# Step 4: WeDLM
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && pip install -e .Docker Installation
# Pull the Docker image
docker pull aiweiliu/wedlm:v3
# Run the container with GPU support
docker run -it --gpus all -p 8080:8080 --name wedlm aiweiliu/wedlm:v3 /bin/bash
# Inside the container, run inference directly
python example.py --model tencent/WeDLM-8B-Instruct
# Inside the container, run web demo
python web_demo.py --model tencent/WeDLM-8B-Instruct# Run simple generation
python example.py --model tencent/WeDLM-8B-InstructExample Output (NVIDIA H20):
Prompt: A store sells apples for $2 each and oranges for $3 each...
Response: To determine the total amount Tom spent...
Therefore, the total amount Tom spent is $22.
==================================================
Generated tokens: 218
Time elapsed: 0.32 seconds
Speed: 689.18 tokens/s β‘
==================================================
Web Demo:
python web_demo.py --model tencent/WeDLM-8B-InstructPython API:
from transformers import AutoTokenizer
from wedlm import LLM, SamplingParams
# Initialize
llm = LLM(model="tencent/WeDLM-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
# Prepare Prompt
prompt = "Solve: 2x + 5 = 13"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Generate
outputs = llm.generate([text], SamplingParams(temperature=0.0, max_tokens=512))
print(outputs[0]["text"])WeDLM's speedup varies by task characteristics. Structured, low-entropy tasks (math, code) see the largest gains.
| Scenario | Speedup vs vLLM | Notes |
|---|---|---|
| Math Reasoning (GSM8K, MATH) | 3-6Γ | Structured output, high confidence predictions |
| Code Generation | 2-3Γ | Predictable syntax patterns |
| Sequential/Counting Tasks | Up to 10Γ | Highly deterministic outputs |
| Open-ended QA | 1.5-2Γ | Higher entropy limits parallel acceptance |
Note
Acceleration comes with a quality-speed tradeoff. Conservative settings preserve accuracy; aggressive settings maximize speed. See our paper for detailed analysis.
WeDLM preserves and often improves upon its base AR model capabilities.
π Base Models Benchmark (Click to collapse)
| Benchmark | Qwen2.5-7B | Qwen3-8B | LLaDA-8B | Dream-7B | WeDLM-7B | WeDLM-8B |
|---|---|---|---|---|---|---|
| ARC-C | 89.93 | 92.66 | 81.14 | 88.40 | 90.70 | 92.92 |
| GSM8K | 79.23 | 85.97 | 71.80 | 75.97 | 84.76 | 90.20 |
| MATH | 43.40 | 50.80 | 28.00 | 38.00 | 48.20 | 53.60 |
| HumanEval | 59.14 | 68.90 | 31.71 | 20.12 | 68.90 | 75.00 |
| MMLU | 71.62 | 74.03 | 64.61 | 70.64 | 71.93 | 75.46 |
| Average | 67.21 | 72.61 | 55.44 | 56.91 | 70.84 | 74.72 |
π Instruct Models Benchmark (Click to expand)
| Benchmark | Qwen2.5-7B-Inst | Qwen3-8B-Inst | SDAR-8B-Inst | WeDLM-7B-Inst | WeDLM-8B-Inst |
|---|---|---|---|---|---|
| ARC-C | 86.09 | 91.47 | 91.13 | 89.59 | 92.92 |
| GSM8K | 89.91 | 89.91 | 91.66 | 87.57 | 92.27 |
| MATH | 45.00 | 69.60 | 43.40 | 55.40 | 64.80 |
| HumanEval | 76.22 | 71.95 | 76.83 | 75.00 | 80.49 |
| MMLU | 71.98 | 71.52 | 73.61 | 70.52 | 75.14 |
| Average | 71.09 | 75.12 | 74.22 | 72.78 | 77.53 |
| Model | Base | Context | Download |
|---|---|---|---|
| WeDLM-7B | Qwen2.5-7B | 32k | |
| WeDLM-7B-Instruct | Qwen2.5-7B | 32k | |
| WeDLM-8B | Qwen3-8B | 32k | |
| WeDLM-8B-Instruct β | Qwen3-8B | 32k |
Requirements: Python 3.9+, PyTorch 2.1+, CUDA 11.8+
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && bash install.shNote
flash-attn requires compilation and must be installed after PyTorch.
The install.sh script handles this automatically (default: CUDA 12.9).
For other CUDA versions: CUDA_VERSION=cu124 bash install.sh
Reproduce our results using the provided scripts:
# 1. Download datasets
python -m evaluation.download_datasets --all
# 2. Run evaluation (e.g., GSM8K)
bash evaluation/evaluation_base.sh \
--model_path "tencent/WeDLM-8B-Instruct" \
--output_dir "output/" \
--datasets "gsm8k" \
--num_gpus 8See evaluation/demo.sh for all benchmark commands.
For training or simple forward passes, we provide a standard HF interface.
Warning
For fast inference, use the wedlm engine (shown in Quick Start). The HF interface below is for training/forward pass convenience only.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
inputs = tokenizer("Hello", return_tensors="pt")
out = model(**inputs)WeDLM introduces Topological Reordering to perform parallel mask recovery under standard causal attention, combined with Streaming Parallel Decoding for continuous prefix commitment.
- π Project Page β Interactive explanations and visualizations
- π Paper β Technical details and full experimental results
If you find WeDLM useful for your research, please cite:
@article{liu2025wedlm,
title={WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference},
author={Liu, Aiwei and He, Minghua and Zeng, Shaoxun and Zhang, Linhao and Wu, Chuhan and Jia, Wei and Liu, Yuan and Yu, Yang and Zhou, Xiao and Zhou, Jie},
journal={arXiv preprint arXiv:2512.22737},
year={2025}
}
