Skip to content

antgroup/SPEED-Q

Repository files navigation

SPEED-Q

This is an official implementation of "SPEED-Q: Staged Processing with Enhanced Distillation Towards Efficient Low-Bit On-Device VLM Quantization" in AAAI 2026. It's a novel framework for low-bit on-device weight-only VLM quantization.

📣 Updates

  • [2026.01.16] 🔥 Our Code is in public on Github. Models are to be released.
  • [2025.11.12] 🔥 Our paper is in public on arxiv.

🌅 Gallery

🎥 Demo

These demos showcase on-device inference in a completely offline environment, where all computations are performed locally on the edge device without any network connectivity.

📊 Qualitative Results

🚀 Quick Start

🛠️ Environment Setup

Python >=3.10 PyTorch >=2.0

  • Tested GPUs: A100(80G)
# Install libraries
$ pip install -r requirements.txt

🧱Model and Data Preparation

Models Download Link
InternVL3-1B 🤗 Huggingface
InternVL3-2B 🤗 Huggingface

Data format is referenced from https://huggingface.co/datasets/Ahren09/llava_zh, and details of the datasets used can be found in the paper's appendix. The final list of datasets used can be found in: data/training_dataset.json.

📝 Training

Example for the quantization process of InternVL3-1B.

Stage1: ViT is quantized using an image-only calibration set.

For the quantization of the ViT, we use the block-wise AdaRound, code is based on https://github.com/yhhhli/BRECQ. The quantized weights of the ViT will be uploaded later.

Stage2: Projector is trained to better align the quantized ViT (qViT).

$ bash stage2_internvl3_1b_2bit_proj.sh
  • SAVE_DIR: Path to the save logs and weights
  • MODEL_PATH: Path to the VLM
  • TEACHER_MODEL_PATH: Path to the bf16 teacher VLM
  • QUANT_VIT_PATH: Path to the quantized ViT weights

Stage3: qViT is frozen, the projector and LLM undergo quantization-aware training.

$ bash stage3_internvl3_1b_2bit_qat.sh

🤗 Evaluation

The quantized weights of the SPEED-Q will be uploaded later.

Dequantize to the float model.

$ bash save_fake_quant.sh

Eval with VLMEvalKit.

We evaluate the quantized VLMs using VLMEvalKit.

model_name="InternVL3-1B-SPEED-Q-2bit"
python run.py --data HallusionBench --model ${model_name} --verbose
python run.py --data AI2D_TEST --model ${model_name} --verbose
python run.py --data OCRBench --model ${model_name} --verbose
python run.py --data MMBench_DEV_EN_V11 --model ${model_name} --verbose
python run.py --data MMBench_DEV_CN_V11 --model ${model_name} --verbose
python run.py --data MMStar --model ${model_name} --verbose
python run.py --data MMMU_DEV_VAL --model ${model_name} --verbose
python run.py --data ScienceQA_VAL --model ${model_name} --verbose
python run.py --data SEEDBench_IMG --model ${model_name} --verbose

📝 TODO List

Status Milestone
Open-source release of SPEED-Q code on GitHub
🚀 Release the InternVL3-1B-2bit/4bit-SPEED-Q models on Hugging Face, including both ViT and VLM components with quantized weights and corresponding dequantized floating-point weights
🚀 Provide comprehensive documentation and code for quantization parameters

📒 Citation

If you find our work useful for your research, please consider citing the paper:

@misc{guo2025speedq,
  title={SPEED-Q: Staged Processing with Enhanced Distillation Towards Efficient Low-Bit On-Device VLM Quantization},
  author={Tianyu Guo, Shanwei Zhao, Shiai Zhu, Chenguang Ma},
  year={2025},
  eprint={2511.08914},
  archivePrefix={arXiv}
}

📜 Reference

🔑 License

The models in this repository are licensed under the Apache 2.0 License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published