GitHub - ydyhello/VecInfer: Official implementation of "VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization" .

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Abstract

The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.

Getting Started

Environment Setup

Note

Environment parameters: CUDA 12.4

Create a Conda Environment:

conda create -n vecinfer python=3.12
conda activate vecinfer

Install Dependencies:

pip install -r requirements.txt
cd scripts/utils/evaluation/latex2sympy && pip install -e .

Install Fast Hadamard Transform:

git clone https://github.com/Dao-AILab/fast-hadamard-transform.git
cd fast-hadamard-transform && git checkout v1.0.4
python setup.py install && cd ..

Install Kernels

Compile CUDA Extensions:

make bindings

Usage

Calibrate Key Cache Channel Scales:

python scripts/utils/calibration.py --model_path /PATH/TO/MODEL --output_dir ./scales

Configure Model Parameters:

Edit a configuration file under scripts/modeldb/configs/ with the following structure:

{
    "model_name" : "llama-3.1-8b-instruct",
    "max_length" : 127500,
    "d" : 128,
    "n_heads" : 32,
    "n_layers" : 32,
    "scales_path" : "/PATH/TO/SCALES",
    "folder" : "/PATH/TO/MODEL"
}

Codebook Generation

We provide the script to generate the codebooks for the KV caches of different models.

make training

Main arguments explanation:

--dataset: Name of the dataset used to build the clustering codebook.
-f: Path to the model configuration file.
-M_key / -M_value: Each sub-vector has a length of 128 / M.
--nbits_key / --nbits_value: Each codebook contains 2 ** nbits sub-vectors.

Tip

We also release codebooks for models from the Llama, Mistral, and Qwen families on Huggingface, which are generated using the Qasper dataset.

Base Model	HF Link
Llama-3.1-8B-Instruct	Link
Mistral-7B-Instruct-v0.3	Link
Qwen2.5-14B-Instruct	Link
Qwen3-8B	Link
DeepSeek-R1-Distill-Llama-8B	Link
DeepSeek-R1-Distill-Qwen-14B	Link

Accuracy Evaluation

LongBench

Download the datasets:

python scripts/utils/download.py --dataset longbench

Run LongBench with VecInfer:

make longbench

Main arguments explanation:

--dataset: Name of the dataset used for evaluation.
--cluster_dataset: Name of the dataset used to build the clustering codebook.
-f: Path to the model configuration file.
-M_key / -M_value: Each sub-vector has a length of 128 / M.
--nbits_key / --nbits_value: Each codebook contains 2 ** nbits sub-vectors.

Note

Results will be saved to scripts/modeldb/results.jsonl for possible further analysis.

MATH

To run the MATH benchmark, simply run:

make math_evaluation

--dataset: Name of the dataset used for evaluation.
--cluster_dataset: Name of the dataset used to build the clustering codebook.
-f: Path to the model configuration file.
-M_key / -M_value: Each sub-vector has a length of 128 / M.
--nbits_key / --nbits_value: Each codebook contains 2 ** nbits sub-vectors.
--save_path: Path to the output directory.

To evaluate benchmark results, simply run:

cd scripts/utils && bash eval_math.sh

Note

Results will be saved to ./math_result for possible further analysis.

Efficiency Evaluation

End-to-end Efficiency:

make e2e

Latency Breakdown:

make breakdown

Kernel-level Efficiency:

cd scripts/modeldb/bindings
bash run_tests.sh

Note

Or utilize NVIDIA Nsight Compute:

ncu --target-processes all --set full --export result.ncu-rep python debug.py

Citation

@misc{yao2025vecinferefficientllminference,
      title={VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization}, 
      author={Dingyu Yao and Chenxu Yang and Zhengyang Tong and Zheng Lin and Wei Liu and Jian Luan and Weiping Wang},
      year={2025},
      eprint={2510.06175},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.06175}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Abstract

Getting Started

Environment Setup

Install Kernels

Usage

Codebook Generation

Accuracy Evaluation

LongBench

MATH

Efficiency Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
datasets		datasets
scripts		scripts
README.md		README.md
makefile		makefile
requirements.txt		requirements.txt

ydyhello/VecInfer

Folders and files

Latest commit

History

Repository files navigation

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Abstract

Getting Started

Environment Setup

Install Kernels

Usage

Codebook Generation

Accuracy Evaluation

LongBench

MATH

Efficiency Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages