Code associated with the paper:
[ACL 2024] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
Self-Speculative Decoding is a novel inference scheme for accelerating Large Language Models (LLMs) without additional neural network training and extra memory footprint. It not only maintains consistent output quality but also ensures model compatibility, making it a plug-and-play and cost-effective solution for LLM inference acceleration.
Self-Speculative Decoding involves a two-stage process:
Drafting stage: Generates draft tokens by selectively skipping certain intermediate layers.
Verification stage: Employs the original LLM to validate draft tokens in one forward pass.
If you find this code and paper useful in your research, please consider citing:
@article{zhang2023draft,
title={Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding},
author={Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra},
year={2023},
eprint={2309.08168},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- PyTorch
- Transformer
- NumPy
- More in ssd.yml
- searching.py: Selection of skipped layers by Bayesian optmization
- decoding.py: Core process of self-speculative decoding
- modeling_llama.py: Model structure with self-speculative decoding
- search.ipynb: Main script searches for skipped layers
- evaluate_sum.ipynb: Main script evaluates self-speculative decoding on text generation task
- evaluate_code.ipynb: Main script evaluates self-speculative decoding on code generation task
- skip_layers.json: Layers skipped by draft models corresponding to different base models
- ssd.yml: Relevant environment
- Configure the relevant environment according to ssd.yml;
- Execute search.ipynb to get skipped layers to generate a draft model;
- Execute evaluate_sum.ipynb to evaluate self-speculative decoding on summarization;
- Execute evaluate_code.ipynb to evaluate self-speculative decoding on code generation.