Skip to content

coder4nlp/Vision-Token-Reduction-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Vision-Token-Reduction-Survey

A Comprehensive Survey and Resource Collection of Vision Token Reduction Techniques for Multimodal Large Models

📌 Project Overview

With the explosive growth of multimodal large models (such as LLaVA, Flamingo, BLIP-2, Qwen-VL), efficient reduction of visual tokens has become a key technology for reducing computational costs and enhancing inference speed. This repository systematically collects, analyzes, and compares the cutting-edge methods and advancements in the field of visual token compression.

Introduction

The current multimodal large model consists of a visual encoder, a connector, and a large language model structure. In MLLMs, more visual tokens provide richer visual information and sigificantly improve the model performance. However, due to the n-squared complexity of the transformer, a large number of visual tokens will result in significant computational and memory consumption.

Core Value: Enable researchers to quickly grasp the progress in the field.

🗂️ Repository structure

Vision-Token-Reduction-Survey
├── papers_summaries/ 
├── methods_comparison/ 
├── datasets/ 
├── tech_reports_blogposts/ 
├── resources/ 
├── CONTRIBUTING.md
└── README.md 
Paper Title One-sentence Abstract Training-Free Date Conference
GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models We propose GreedyPrune, a training-free visual token pruning method that jointly optimizes semantic saliency and visual diversity through a combinatorial optimization framework, achieving state-of-the-art accuracy and reduced inference latency across multimodal tasks and models. 202506 arXiv (preprint)
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [PDF] We propose SP-VLA, a unified framework for accelerating Vision-Language-Action (VLA) models through joint model scheduling and token pruning, effectively reducing both temporal redundancy in sequential action generation and spatial redundancy in visual input while maintaining high accuracy, achieving up to 1.5× acceleration with less than 3% accuracy drop across multiple tasks. 202506 arXiv (preprint)
Diversity-Guided MLP Reduction for Efficient Large Vision Transformers This paper proposes a Diversity-Guided MLP Reduction (DGMR) method to significantly compress large vision transformers by pruning redundant neurons in MLP modules while preserving weight diversity, achieving over 57.0% parameter and FLOPs reduction with near-lossless performance across multiple state-of-the-art models, including a 71.5% reduction for EVA-CLIP-E without performance degradation. 202506 arXiv (preprint)
Learning Compact Vision Tokens for Efficient Large Multimodal Models [PDF] [Github] This paper proposes a Spatial Token Fusion (STF) method and a Multi-Block Token Fusion (MBTF) module to reduce vision token sequences and enhance multi-granularity feature representation, achieving significant inference acceleration with minimal performance loss in large multimodal models. 202506 arXiv (preprint)
Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration [PDF] This paper proposes a many-to-many Token Transforming framework for vision transformers, unifying existing token reduction methods into an explicit matrix transformation form, which minimizes information loss and enables training-free acceleration, achieving significant FLOPs reduction, inference speedup, and improved performance across various vision tasks such as segmentation, object detection, depth estimation, and language model generation. 202506 arXiv (preprint)
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [PDF] This paper introduces LLaVA-Meteor, a novel visual instruction tuning framework that achieves significant visual token compression (75%–95%) and improved efficiency while maintaining or enhancing performance across 12 vision-language benchmarks through a Top-Down Compression paradigm, Flash Global Fusion module, and Visual-Native Selection mechanism. 202505 arXiv (preprint)
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [PDF] This work proposes VScan, a two-stage visual token reduction framework for large vision-language models (LVLMs), achieving significant inference acceleration (2.91× speedup in prefilling, 10× FLOPs reduction) with minimal performance loss (95.4% retention) through complementary global/local token merging and intermediate-layer pruning. 202505 arXiv (preprint)
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models. [PDF] [Github] We introduce PACT, a method that reduces inference time and memory usage in visual language models by pruning irrelevant tokens and merging visually redundant ones early in the model using a novel importance metric and Distance Bounded Density Peak Clustering. 202504 CVPR 2025
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs [PDF] [Github] TRIM (Token Reduction using CLIP Metric) enhances Multimodal Large Language Models (MLLMs) efficiency by reducing image tokens without performance loss, validated across 12 datasets, advancing sustainable high-performance model development. 202409 COLING 2025
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [PDF] We propose TopV, a training-free token pruning method for Vision-Language Models that formulates pruning as an optimization problem using a visual-aware cost function, achieving efficient inference with reduced memory and computational cost while maintaining performance. 202503 CVPR2025
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models [PDF][Github] We propose DivPrune, a token pruning method for Large Multimodal Models that formulates pruning as a Max-Min Diversity Problem to maximize diversity among selected visual tokens, achieving state-of-the-art accuracy with reduced latency and memory usage across 16 image- and video-language datasets. 202503 CVPR 2025
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression [PDF][Github] We propose InternVL-X, a vision-language model that improves performance and efficiency through three visual token compression techniques—PVTC, LVTC, and RVTC—enabling state-of-the-art results with significantly reduced computational cost by using 20% or fewer visual tokens. 202503 arXiv (preprint)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [PDF] [Github] We propose FastV, a plug-and-play method for optimizing computational efficiency in Large Vision-Language Models (LVLMs) by learning adaptive attention patterns and pruning visual tokens, achieving significant reductions in FLOPs (e.g., 45% for LLaVA-1.5-13B) while maintaining strong performance across image and video understanding tasks, making it highly suitable for edge deployment and commercial applications. 202403 ECCV 2024 (Oral)

How LMMs work ?

An LMM typically processes a pair of inputs, denoted as $(T,V)$, where T is the text input and $V$ is the visual input such as image or video.The text input is mapped to $N$ textual tokens $E_t={t_1, \dots, t_N}$ using a text encoder.Similarly, the visual input is processed by a corresponding vision encoder. Specifically, it takes visual information $V$ as input and outputs image features, that are further converted to $M$ (generally $M \gg N$)vision tokens $E_v={v_1,\dots, v_M}$ using a projector layer.

The textual tokens and visual tokens are then combined to be fed to an LLM to generate the prediction in an autoregressive manner. Specifically, $\hat N$ output tokens $Y={y_1,\dots, y_{\hat N}}$ are generated as follows:

$$ P(y_1,\dots,y_{\hat N}| E_t,E_v)=\prod P(y_i| y< i, E_t,E_v) $$

TFLOP ratio

TFLOP ratio is the TFLOP of the model with pruned tokens relative to the original model’s TFLOP with no pruning.

$$ \frac {K \times (4\mu d^2-2\mu^2d +2\mu dm)+(T-K)\times (4 \widetilde \mu d^2-2\widetilde \mu dm)}{T\times (4-\mu d^2-2 \mu ^2d+2 \mu dm)} $$

where $T$ is the total transformer-based decoder layers. $\mu = N+M$ is the total sequence length before pruning,$\widetilde \mu=N+M$is the sequence length after pruning. $d$ is the hidden state size of the layer, and m is the intermediate size of feed-forward network module.

Comparison of performance and speed of different methods

Performance comparisons on LLaVA-1.5-7B

Method Venue GQA MMB MMBCN MME POPE SQAIMG VQAV2 VQAText VizWiz Average
Upper Bound, 576 Tokens (100%), 3,817 TFLOPs
LLaVA-1.5-7B 61.9 64.7 58.1 1862 85.9 69.5 78.5 58.2 50.0 100.0%
Retain 192 Tokens in Average (↓ 66.6%), ~1,253 TFLOPs
ToMe [7] 54.3 60.5 - 1563 72.4 65.2 68.0 52.1 - 88.5%
FastV [12] 52.7 61.2 57.0 1612 64.8 67.3 67.1 52.5 50.8 90.4%
SparseVLM [69] 57.6 62.5 53.7 1721 83.6 69.1 75.6 56.1 50.5 96.1%
PyramidDrop [60] 57.3 63.3 56.8 1797 82.3 69.0 75.1 56.5 51.1 97.2%
VisionZip 59.3 63.0 - 1783 85.3 68.9 77.4 57.3 - 97.8%
VScan (Ours) 60.6 63.9 57.4 1806 86.2 68.6 77.8 57.7 50.4 99.0%
Retain 128 Tokens in Average (↓ 77.8%), ~833 TFLOPs
ToMe 52.4 53.3 - 1343 62.8 59.6 63.0 49.1 - 80.4%
FastV 49.6 56.1 56.4 1490 59.6 60.2 61.8 50.6 51.3 85.4%
SparseVLM 56.0 60.0 51.1 1696 80.5 67.1 73.8 54.9 51.4 93.7%
PyramidDrop 57.1 61.6 56.6 1761 82.3 68.4 72.9 56.6 51.0 96.2%
VisionZip 57.6 62.0 - 1763 83.2 68.9 75.6 56.8 - 96.2%
VScan (Ours) - 59.8 63.0 58.0 1792 86.1 68.9 77.1 57.3 51.7 98.8%
Retain 64 Tokens in Average (↓ 88.9%), ~415 TFLOPs
ToMe 48.6 43.7 - 1138 52.5 50.0 57.1 45.3 - 70.1%
FastV 46.1 48.0 52.7 1256 48.0 51.1 55.0 47.8 50.8 76.7%
SparseVLM 52.7 56.2 46.1 1505 75.1 62.2 68.2 51.8 50.1 87.2%
PyramidDrop 47.5 58.8 50.5 1561 55.9 69.2 69.2 50.6 50.7 86.6%
VisionZip 55.1 60.1 - 1690 77.0 69.0 72.4 55.5 - 92.7%
VScan (Ours) - 58.3 62.1 55.7 1698 85.0 69.1 75.4 55.6 51.8 96.7%

Comparison based on different training methods

Model LLM PT/IT Token $VQA^T$ $VQA^D$ $QA^C$ $VQA^I$ GQA $VQA^{v2}$ VizWiz MMB MMVet MMMU POPE SEED Avg
MobileVLM V2 Mobilellama-2.7B 1.2M/3.6M 144 52.1 - - - 59.3 - - - - - 84.3 - -
BLIP-2 Vicuna-13B 129M/- 32 42.5 - - - 41.0 65.0 19.6 - - - 85.3 49.7 -
Instruct-BLIP Vicuna-7B 129M/1.2M 64 50.1 - - - 49.5 34.5 - 26.3 - - - - -
QwenVL Qwen-7B 1.4B/50M 256 63.8 65.1 65.7 - 59.3 78.8 35.2 - - - 62.3 - -
VILA Llama2-7B 50M/1M 576 64.4 - 58.6 - 62.3 79.9 57.8 68.9 34.9 - 85.5 - -
MobileVLM V2 Vicuna-7B 1.2M/3.6M 144 62.3 - - - 62.6 - - - - - 85.3 - -
Mini-Gemini Vicuna-7B 1.2M/1.5M 576 65.9 - - - - - 68.5 46.0 38.1 - - - -
LLaVA-1.5 Vicuna-7B 558K/665K 576 58.2 28.1 - 25.8 63.3 78.5 50.0 64.3 31.1 35.3 85.9 66.1 -
TokenPacker Vicuna-7B 558K/665K 144 - 26.9 18.1 21.8 61.9 77.9 52.0 65.1 33.0 - 87.0 - -
InternVL2 Internlm2.5-7B 558K/665K 256 49.7 - - - 63.0 77.8 50.6 70.9 34.1 39.2 86.8 71.1 50.8
High - resolution LLMs
Monkey Qwen-7B -/1.44M ~1024 67.7 66.5 36.1 - 60.7 80.3 61.2 - - - - - -
TokenPacker-HD Vicuna-7B 1.2M/1.5M ~954 68.0 60.2 - - - 81.2 54.7 67.4 - 35.4 - - -
Mini-Gemini-HD Vicuna-7B 1.2M/1.5M 2880 68.4 65.0 - - - 80.3 54.6 65.8 41.3 36.8 86.8 - -
FastVITHD Qwen-2-7B 558K/1.1M 256 64.4 - - - - 63.1 - - - - 88.1 - -
LLaVA-UHD Vicuna-13B 595K/665K ~256 67.7 62.6 56.3 36.8 63.8 81.7 56.1 68.0 42.1 35.5 89.1 65.6 60.4
LLaVA-NeXT Vicuna-7B 558K/765K ~2880 64.9 74.4 54.8 37.1 64.2 81.8 57.6 68.1 43.9 35.8 86.5 68.2 61.4
InternVL2-HD Internlm2.5-7B 558K/770K ~1282 65.6 72.6 69.8 30.9 63.2 78.9 56.3 72.1 35.7 39.9 87.3 73.4 62.1
Ours
LLaVA-Meteor
compare to LLaVA-UHD
Vicuna-13B 595K/665K ~256 69.9
100%
64.2
+2.2
59.0
+1.6
39.2
+2.4
64.9
+1.1
82.4
+0.7
59.3
+3.2
69.4
+1.4
44.7
+2.6
37.5
+2.0
89.9
+0.8
67.7
+2.1
62.4
+2.0
LLaVA-Meteor
compare to LLaVA-UHD
Vicuna-13B 595K/665K ~114 68.3
44.5%
63.1
+0.6
58.6
+0.5
37.7
+2.3
64.6
+0.8
81.8
+0.1
57.1
+1.0
68.4
+0.4
42.7
+0.6
34.6
-0.8
88.7
-0.5
66.9
+1.3
61.0
+0.6
LLaVA-Meteor
compare to LLaVA-UHD
Vicuna-13B 595K/665K ~56 65.0
21.8%
58.4
-2.7
56.5
-4.2
37.1
+0.2
62.4
+0.3
81.2
-1.4
55.3
-0.5
68.0
+0.0
41.6
-0.5
34.2
-1.3
87.2
-1.9
64.8
-0.8
59.3
-1.1

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages