This is an official implementation of "SPEED-Q: Staged Processing with Enhanced Distillation Towards Efficient Low-Bit On-Device VLM Quantization" in AAAI 2026. It's a novel framework for low-bit on-device weight-only VLM quantization.
- [2026.01.16] 🔥 Our Code is in public on Github. Models are to be released.
- [2025.11.12] 🔥 Our paper is in public on arxiv.
These demos showcase on-device inference in a completely offline environment, where all computations are performed locally on the edge device without any network connectivity.
|
|
|
|
- Tested GPUs: A100(80G)
# Install libraries
$ pip install -r requirements.txt| Models | Download Link |
|---|---|
| InternVL3-1B | 🤗 Huggingface |
| InternVL3-2B | 🤗 Huggingface |
Data format is referenced from https://huggingface.co/datasets/Ahren09/llava_zh, and details of the datasets used can be found in the paper's appendix. The final list of datasets used can be found in: data/training_dataset.json.
Example for the quantization process of InternVL3-1B.
For the quantization of the ViT, we use the block-wise AdaRound, code is based on https://github.com/yhhhli/BRECQ. The quantized weights of the ViT will be uploaded later.
$ bash stage2_internvl3_1b_2bit_proj.shSAVE_DIR: Path to the save logs and weightsMODEL_PATH: Path to the VLMTEACHER_MODEL_PATH: Path to the bf16 teacher VLMQUANT_VIT_PATH: Path to the quantized ViT weights
$ bash stage3_internvl3_1b_2bit_qat.shThe quantized weights of the SPEED-Q will be uploaded later.
$ bash save_fake_quant.shWe evaluate the quantized VLMs using VLMEvalKit.
model_name="InternVL3-1B-SPEED-Q-2bit"
python run.py --data HallusionBench --model ${model_name} --verbose
python run.py --data AI2D_TEST --model ${model_name} --verbose
python run.py --data OCRBench --model ${model_name} --verbose
python run.py --data MMBench_DEV_EN_V11 --model ${model_name} --verbose
python run.py --data MMBench_DEV_CN_V11 --model ${model_name} --verbose
python run.py --data MMStar --model ${model_name} --verbose
python run.py --data MMMU_DEV_VAL --model ${model_name} --verbose
python run.py --data ScienceQA_VAL --model ${model_name} --verbose
python run.py --data SEEDBench_IMG --model ${model_name} --verbose| Status | Milestone |
|---|---|
| ✅ | Open-source release of SPEED-Q code on GitHub |
| 🚀 | Release the InternVL3-1B-2bit/4bit-SPEED-Q models on Hugging Face, including both ViT and VLM components with quantized weights and corresponding dequantized floating-point weights |
| 🚀 | Provide comprehensive documentation and code for quantization parameters |
If you find our work useful for your research, please consider citing the paper:
@misc{guo2025speedq,
title={SPEED-Q: Staged Processing with Enhanced Distillation Towards Efficient Low-Bit On-Device VLM Quantization},
author={Tianyu Guo, Shanwei Zhao, Shiai Zhu, Chenguang Ma},
year={2025},
eprint={2511.08914},
archivePrefix={arXiv}
}
- InternVL: https://github.com/OpenGVLab/InternVL
- VLMEvalKit: https://github.com/open-compass/VLMEvalKit
The models in this repository are licensed under the Apache 2.0 License.





