Vision-Token-Reduction-Survey

A Comprehensive Survey and Resource Collection of Vision Token Reduction Techniques for Multimodal Large Models

📌 Project Overview

With the explosive growth of multimodal large models (such as LLaVA, Flamingo, BLIP-2, Qwen-VL), efficient reduction of visual tokens has become a key technology for reducing computational costs and enhancing inference speed. This repository systematically collects, analyzes, and compares the cutting-edge methods and advancements in the field of visual token compression.

Introduction

The current multimodal large model consists of a visual encoder, a connector, and a large language model structure. In MLLMs, more visual tokens provide richer visual information and sigificantly improve the model performance. However, due to the n-squared complexity of the transformer, a large number of visual tokens will result in significant computational and memory consumption.

Core Value: Enable researchers to quickly grasp the progress in the field.

🗂️ Repository structure

Vision-Token-Reduction-Survey
├── papers_summaries/ 
├── methods_comparison/ 
├── datasets/ 
├── tech_reports_blogposts/ 
├── resources/ 
├── CONTRIBUTING.md
└── README.md

Paper Title	One-sentence Abstract	Training-Free	Date	Conference
GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models	We propose GreedyPrune, a training-free visual token pruning method that jointly optimizes semantic saliency and visual diversity through a combinatorial optimization framework, achieving state-of-the-art accuracy and reduced inference latency across multimodal tasks and models.	✔	202506	arXiv (preprint)
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [PDF]	We propose SP-VLA, a unified framework for accelerating Vision-Language-Action (VLA) models through joint model scheduling and token pruning, effectively reducing both temporal redundancy in sequential action generation and spatial redundancy in visual input while maintaining high accuracy, achieving up to 1.5× acceleration with less than 3% accuracy drop across multiple tasks.	✗	202506	arXiv (preprint)
Diversity-Guided MLP Reduction for Efficient Large Vision Transformers	This paper proposes a Diversity-Guided MLP Reduction (DGMR) method to significantly compress large vision transformers by pruning redundant neurons in MLP modules while preserving weight diversity, achieving over 57.0% parameter and FLOPs reduction with near-lossless performance across multiple state-of-the-art models, including a 71.5% reduction for EVA-CLIP-E without performance degradation.	✗	202506	arXiv (preprint)
Learning Compact Vision Tokens for Efficient Large Multimodal Models [PDF] [Github]	This paper proposes a Spatial Token Fusion (STF) method and a Multi-Block Token Fusion (MBTF) module to reduce vision token sequences and enhance multi-granularity feature representation, achieving significant inference acceleration with minimal performance loss in large multimodal models.	✗	202506	arXiv (preprint)
Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration [PDF]	This paper proposes a many-to-many Token Transforming framework for vision transformers, unifying existing token reduction methods into an explicit matrix transformation form, which minimizes information loss and enables training-free acceleration, achieving significant FLOPs reduction, inference speedup, and improved performance across various vision tasks such as segmentation, object detection, depth estimation, and language model generation.	✔	202506	arXiv (preprint)
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [PDF]	This paper introduces LLaVA-Meteor, a novel visual instruction tuning framework that achieves significant visual token compression (75%–95%) and improved efficiency while maintaining or enhancing performance across 12 vision-language benchmarks through a Top-Down Compression paradigm, Flash Global Fusion module, and Visual-Native Selection mechanism.	✗	202505	arXiv (preprint)
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [PDF]	This work proposes VScan, a two-stage visual token reduction framework for large vision-language models (LVLMs), achieving significant inference acceleration (2.91× speedup in prefilling, 10× FLOPs reduction) with minimal performance loss (95.4% retention) through complementary global/local token merging and intermediate-layer pruning.	✔	202505	arXiv (preprint)
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models. [PDF] [Github]	We introduce PACT, a method that reduces inference time and memory usage in visual language models by pruning irrelevant tokens and merging visually redundant ones early in the model using a novel importance metric and Distance Bounded Density Peak Clustering.	✔	202504	CVPR 2025
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs [PDF] [Github]	TRIM (Token Reduction using CLIP Metric) enhances Multimodal Large Language Models (MLLMs) efficiency by reducing image tokens without performance loss, validated across 12 datasets, advancing sustainable high-performance model development.	✗	202409	COLING 2025
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [PDF]	We propose TopV, a training-free token pruning method for Vision-Language Models that formulates pruning as an optimization problem using a visual-aware cost function, achieving efficient inference with reduced memory and computational cost while maintaining performance.	✔	202503	CVPR2025
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models [PDF][Github]	We propose DivPrune, a token pruning method for Large Multimodal Models that formulates pruning as a Max-Min Diversity Problem to maximize diversity among selected visual tokens, achieving state-of-the-art accuracy with reduced latency and memory usage across 16 image- and video-language datasets.	✔	202503	CVPR 2025
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression [PDF][Github]	We propose InternVL-X, a vision-language model that improves performance and efficiency through three visual token compression techniques—PVTC, LVTC, and RVTC—enabling state-of-the-art results with significantly reduced computational cost by using 20% or fewer visual tokens.	✗	202503	arXiv (preprint)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [PDF] [Github]	We propose FastV, a plug-and-play method for optimizing computational efficiency in Large Vision-Language Models (LVLMs) by learning adaptive attention patterns and pruning visual tokens, achieving significant reductions in FLOPs (e.g., 45% for LLaVA-1.5-13B) while maintaining strong performance across image and video understanding tasks, making it highly suitable for edge deployment and commercial applications.	✔	202403	ECCV 2024 (Oral)

How LMMs work ?

An LMM typically processes a pair of inputs, denoted as $(T,V)$, where T is the text input and $V$ is the visual input such as image or video.The text input is mapped to $N$ textual tokens $E_t={t_1, \dots, t_N}$ using a text encoder.Similarly, the visual input is processed by a corresponding vision encoder. Specifically, it takes visual information $V$ as input and outputs image features, that are further converted to $M$ (generally $M \gg N$)vision tokens $E_v={v_1,\dots, v_M}$ using a projector layer.

The textual tokens and visual tokens are then combined to be fed to an LLM to generate the prediction in an autoregressive manner. Specifically, $\hat N$ output tokens $Y={y_1,\dots, y_{\hat N}}$ are generated as follows:

$$ P(y_1,\dots,y_{\hat N}| E_t,E_v)=\prod P(y_i| y< i, E_t,E_v) $$

TFLOP ratio

TFLOP ratio is the TFLOP of the model with pruned tokens relative to the original model’s TFLOP with no pruning.

$$ \frac {K \times (4\mu d^2-2\mu^2d +2\mu dm)+(T-K)\times (4 \widetilde \mu d^2-2\widetilde \mu dm)}{T\times (4-\mu d^2-2 \mu ^2d+2 \mu dm)} $$

where $T$ is the total transformer-based decoder layers. $\mu = N+M$ is the total sequence length before pruning,$\widetilde \mu=N+M$is the sequence length after pruning. $d$ is the hidden state size of the layer, and m is the intermediate size of feed-forward network module.

Comparison of performance and speed of different methods

Performance comparisons on LLaVA-1.5-7B

Method	Venue	GQA	MMB	MMB^CN	MME	POPE	SQA^IMG	VQA^V2	VQA^Text	VizWiz	Average
Upper Bound, 576 Tokens (100%), 3,817 TFLOPs
LLaVA-1.5-7B	61.9	64.7	58.1	1862	85.9	69.5	78.5	58.2	50.0	100.0%
Retain 192 Tokens in Average (↓ 66.6%), ~1,253 TFLOPs
ToMe [7]	54.3	60.5	-	1563	72.4	65.2	68.0	52.1	-	88.5%
FastV [12]	52.7	61.2	57.0	1612	64.8	67.3	67.1	52.5	50.8	90.4%
SparseVLM [69]	57.6	62.5	53.7	1721	83.6	69.1	75.6	56.1	50.5	96.1%
PyramidDrop [60]	57.3	63.3	56.8	1797	82.3	69.0	75.1	56.5	51.1	97.2%
VisionZip	59.3	63.0	-	1783	85.3	68.9	77.4	57.3	-	97.8%
VScan (Ours)	60.6	63.9	57.4	1806	86.2	68.6	77.8	57.7	50.4	99.0%
Retain 128 Tokens in Average (↓ 77.8%), ~833 TFLOPs
ToMe	52.4	53.3	-	1343	62.8	59.6	63.0	49.1	-	80.4%
FastV	49.6	56.1	56.4	1490	59.6	60.2	61.8	50.6	51.3	85.4%
SparseVLM	56.0	60.0	51.1	1696	80.5	67.1	73.8	54.9	51.4	93.7%
PyramidDrop	57.1	61.6	56.6	1761	82.3	68.4	72.9	56.6	51.0	96.2%
VisionZip	57.6	62.0	-	1763	83.2	68.9	75.6	56.8	-	96.2%
VScan (Ours)	-	59.8	63.0	58.0	1792	86.1	68.9	77.1	57.3	51.7	98.8%
Retain 64 Tokens in Average (↓ 88.9%), ~415 TFLOPs
ToMe	48.6	43.7	-	1138	52.5	50.0	57.1	45.3	-	70.1%
FastV	46.1	48.0	52.7	1256	48.0	51.1	55.0	47.8	50.8	76.7%
SparseVLM	52.7	56.2	46.1	1505	75.1	62.2	68.2	51.8	50.1	87.2%
PyramidDrop	47.5	58.8	50.5	1561	55.9	69.2	69.2	50.6	50.7	86.6%
VisionZip	55.1	60.1	-	1690	77.0	69.0	72.4	55.5	-	92.7%
VScan (Ours)	-	58.3	62.1	55.7	1698	85.0	69.1	75.4	55.6	51.8	96.7%

Comparison based on different training methods

Model	LLM	PT/IT	Token	$VQA^T$	$VQA^D$	$QA^C$	$VQA^I$	GQA	$VQA^{v2}$	VizWiz	MMB	MMVet	MMMU	POPE	SEED	Avg
MobileVLM V2	Mobilellama-2.7B	1.2M/3.6M	144	52.1	-	-	-	59.3	-	-	-	-	-	84.3	-	-
BLIP-2	Vicuna-13B	129M/-	32	42.5	-	-	-	41.0	65.0	19.6	-	-	-	85.3	49.7	-
Instruct-BLIP	Vicuna-7B	129M/1.2M	64	50.1	-	-	-	49.5	34.5	-	26.3	-	-	-	-	-
QwenVL	Qwen-7B	1.4B/50M	256	63.8	65.1	65.7	-	59.3	78.8	35.2	-	-	-	62.3	-	-
VILA	Llama2-7B	50M/1M	576	64.4	-	58.6	-	62.3	79.9	57.8	68.9	34.9	-	85.5	-	-
MobileVLM V2	Vicuna-7B	1.2M/3.6M	144	62.3	-	-	-	62.6	-	-	-	-	-	85.3	-	-
Mini-Gemini	Vicuna-7B	1.2M/1.5M	576	65.9	-	-	-	-	-	68.5	46.0	38.1	-	-	-	-
LLaVA-1.5	Vicuna-7B	558K/665K	576	58.2	28.1	-	25.8	63.3	78.5	50.0	64.3	31.1	35.3	85.9	66.1	-
TokenPacker	Vicuna-7B	558K/665K	144	-	26.9	18.1	21.8	61.9	77.9	52.0	65.1	33.0	-	87.0	-	-
InternVL2	Internlm2.5-7B	558K/665K	256	49.7	-	-	-	63.0	77.8	50.6	70.9	34.1	39.2	86.8	71.1	50.8
High - resolution LLMs
Monkey	Qwen-7B	-/1.44M	~1024	67.7	66.5	36.1	-	60.7	80.3	61.2	-	-	-	-	-	-
TokenPacker-HD	Vicuna-7B	1.2M/1.5M	~954	68.0	60.2	-	-	-	81.2	54.7	67.4	-	35.4	-	-	-
Mini-Gemini-HD	Vicuna-7B	1.2M/1.5M	2880	68.4	65.0	-	-	-	80.3	54.6	65.8	41.3	36.8	86.8	-	-
FastVITHD	Qwen-2-7B	558K/1.1M	256	64.4	-	-	-	-	63.1	-	-	-	-	88.1	-	-
LLaVA-UHD	Vicuna-13B	595K/665K	~256	67.7	62.6	56.3	36.8	63.8	81.7	56.1	68.0	42.1	35.5	89.1	65.6	60.4
LLaVA-NeXT	Vicuna-7B	558K/765K	~2880	64.9	74.4	54.8	37.1	64.2	81.8	57.6	68.1	43.9	35.8	86.5	68.2	61.4
InternVL2-HD	Internlm2.5-7B	558K/770K	~1282	65.6	72.6	69.8	30.9	63.2	78.9	56.3	72.1	35.7	39.9	87.3	73.4	62.1
Ours
LLaVA-Meteor compare to LLaVA-UHD	Vicuna-13B	595K/665K	~256	69.9 100%	64.2 +2.2	59.0 +1.6	39.2 +2.4	64.9 +1.1	82.4 +0.7	59.3 +3.2	69.4 +1.4	44.7 +2.6	37.5 +2.0	89.9 +0.8	67.7 +2.1	62.4 +2.0
LLaVA-Meteor compare to LLaVA-UHD	Vicuna-13B	595K/665K	~114	68.3 44.5%	63.1 +0.6	58.6 +0.5	37.7 +2.3	64.6 +0.8	81.8 +0.1	57.1 +1.0	68.4 +0.4	42.7 +0.6	34.6 -0.8	88.7 -0.5	66.9 +1.3	61.0 +0.6
LLaVA-Meteor compare to LLaVA-UHD	Vicuna-13B	595K/665K	~56	65.0 21.8%	58.4 -2.7	56.5 -4.2	37.1 +0.2	62.4 +0.3	81.2 -1.4	55.3 -0.5	68.0 +0.0	41.6 -0.5	34.2 -1.3	87.2 -1.9	64.8 -0.8	59.3 -1.1

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
papers_summaries/2025		papers_summaries/2025
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Token-Reduction-Survey

📌 Project Overview

Introduction

🗂️ Repository structure

How LMMs work ?

TFLOP ratio

Comparison of performance and speed of different methods

Performance comparisons on LLaVA-1.5-7B

Comparison based on different training methods

About

Uh oh!

Releases

Packages

Uh oh!

coder4nlp/Vision-Token-Reduction-Survey

Folders and files

Latest commit

History

Repository files navigation

Vision-Token-Reduction-Survey

📌 Project Overview

Introduction

🗂️ Repository structure

How LMMs work ?

TFLOP ratio

Comparison of performance and speed of different methods

Performance comparisons on LLaVA-1.5-7B

Comparison based on different training methods

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages