We present a systematic evaluation of the compatibility between speculative decoding and quantization.
We also propose a hierarchical speculative decoding framework for W4A16 models, achieving a 1.31
And all experiments are implemented in C/CUDA.
Speedup achieved by integrating Speculative Decoding and Quantization.
conda create -n specmquant python=3.11 && conda activate specmquant
# install pytorch for your platform, see https://pytorch.org
git clone https://github.com/AI9Stars/SpecMQuant --recursive && cd SpecMQuant
vim setup.py # change arch="80" to other code for your platform, see https://developer.nvidia.com/cuda-gpus#compute
pip install .
Downloads the quantized model weights and corresponding EAGLE model into the models
folder.
Or you can also choose one of the following external toolkits to quantize your model and then convert the resulting checkpoints.
Toolkit | Precison | Algorithm |
---|---|---|
AutoGPTQ | W4A16 | GPTQ |
QQQ | W4A8 | QQQ |
DeepCompressor | W8A8 | SmoothQuant |
W4A8 | QoQ |
For AutoGPTQ, our framework is only compatible when
sym=True
is set in the config, and if you setdesc_act=True
then you should also setstatic_group=True
.
For W4A16
, W4A8-QQQ
, W4A8-QQQ-g128
and W4A8-QoQ-g128
, after quantizing with the above toolkits you need to convert the model checkpoints using the scripts in scripts/model_convert
. And for the models applied with rotation method, you need to convert the eagle checkpoint using the scripts scripts/model_convert/convert_eagle_rotation.sh
with the corresponding rotation matrix.
All scripts for MT-Bench
evaluation are located in the scripts/eval/mt_bench
folder. Here we use Llama-3-8B-Instruct as an example:
# 1. Run evaluations
bash scripts/eval/mt_bench/llama3-8b-instruct/<precision>/run_baseline.sh
bash scripts/eval/mt_bench/llama3-8b-instruct/<precision>/run_eagle.sh
# 2. Evaluate speed
bash scripts/mt_bench/llama3-8b-instruct/speed_up.sh
Replace <precision>
with one of: fp16
, w4a16
, w4a8-qqq
, w4a8-qqq-g128
, w4a8-qoq
, or w4a8-qoq-g128
.
Scripts for Spec-Bench
evaluation in W4A16 Llama-3-70B-Instruct models are located in the scripts/eval/spec_bench
folder.
# 1. Run evaluations
bash scripts/eval/spec_bench/llama3-70b-instruct-w4a16/run_baseline.sh
bash scripts/eval/spec_bench/llama3-70b-instruct-w4a16/run_spec.sh
bash scripts/eval/spec_bench/llama3-70b-instruct-w4a16/run_eagle.sh
bash scripts/eval/spec_bench/llama3-70b-instruct-w4a16/run_hierspec.sh
# 2. Evaluate speed
bash scripts/eval/spec_bench/llama3-70b-instruct-w4a16/speedup.sh
We provide the performance evaluation for gsm8k
and human_eval
.
# 1. Run evaluations
bash scripts/eval/<benchmark>/llama3-8b-instruct/<precision>/run_baseline.sh
# 2. Evaluate preformance
bash scripts/eval/<benchmark>/llama3-8b-instruct/check_correctness.sh
Replace <benchmark>
with one of: gsm8k
or human_eval
.
Our framework is based on https://github.com/thunlp/FR-Spec.
Our experiments are based on https://github.com/SafeAILab/EAGLE.
The CUDA quantization kernels in src/qgemm
are borrowed from:
- W4A16 marlin kernel:https://github.com/vllm-project/vllm and https://github.com/IST-DASLab/marlin.
- W4A8-QQQ kernel: https://github.com/HandH1998/QQQ.
- W8A8 and W4A8-QoQ: https://github.com/mit-han-lab/omniserve.
The evaluation/
folder is modified base on https://github.com/hemingkx/Spec-Bench:
- The
evaluation/gsm8k
folder integrates part of the code from https://github.com/Guangxuan-Xiao/GSM8K-eval. - The
evaluation/humaneval
folder integrates part of the code from https://github.com/evalplus/evalplus.
The src/flash_attn/
folder is modified base on https://github.com/Dao-AILab/flash-attention/blob/v2.4.2/csrc/flash_attn.
@article{zhang2025specmqaunt,
title={Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design},
author={Zhang, Yudi and Zhao, Weilin and Han, Xu and Zhao, Tiejun and Xu, Wang and Cao, Hailong and Zhu, Conghui},
journal={arXiv preprint arXiv:2505.22179},
year={2025}
}