A Comprehensive Survey and Resource Collection of Vision Token Reduction Techniques for Multimodal Large Models
With the explosive growth of multimodal large models (such as LLaVA, Flamingo, BLIP-2, Qwen-VL), efficient reduction of visual tokens has become a key technology for reducing computational costs and enhancing inference speed. This repository systematically collects, analyzes, and compares the cutting-edge methods and advancements in the field of visual token compression.
The current multimodal large model consists of a visual encoder, a connector, and a large language model structure. In MLLMs, more visual tokens provide richer visual information and sigificantly improve the model performance. However, due to the n-squared complexity of the transformer, a large number of visual tokens will result in significant computational and memory consumption.
Core Value: Enable researchers to quickly grasp the progress in the field.
Vision-Token-Reduction-Survey
├── papers_summaries/
├── methods_comparison/
├── datasets/
├── tech_reports_blogposts/
├── resources/
├── CONTRIBUTING.md
└── README.md
| Paper Title | One-sentence Abstract | Training-Free | Date | Conference |
|---|---|---|---|---|
| GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models | We propose GreedyPrune, a training-free visual token pruning method that jointly optimizes semantic saliency and visual diversity through a combinatorial optimization framework, achieving state-of-the-art accuracy and reduced inference latency across multimodal tasks and models. | ✔ | 202506 | arXiv (preprint) |
| SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [PDF] | We propose SP-VLA, a unified framework for accelerating Vision-Language-Action (VLA) models through joint model scheduling and token pruning, effectively reducing both temporal redundancy in sequential action generation and spatial redundancy in visual input while maintaining high accuracy, achieving up to 1.5× acceleration with less than 3% accuracy drop across multiple tasks. | ✗ | 202506 | arXiv (preprint) |
| Diversity-Guided MLP Reduction for Efficient Large Vision Transformers | This paper proposes a Diversity-Guided MLP Reduction (DGMR) method to significantly compress large vision transformers by pruning redundant neurons in MLP modules while preserving weight diversity, achieving over 57.0% parameter and FLOPs reduction with near-lossless performance across multiple state-of-the-art models, including a 71.5% reduction for EVA-CLIP-E without performance degradation. | ✗ | 202506 | arXiv (preprint) |
| Learning Compact Vision Tokens for Efficient Large Multimodal Models [PDF] [Github] | This paper proposes a Spatial Token Fusion (STF) method and a Multi-Block Token Fusion (MBTF) module to reduce vision token sequences and enhance multi-granularity feature representation, achieving significant inference acceleration with minimal performance loss in large multimodal models. | ✗ | 202506 | arXiv (preprint) |
| Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration [PDF] | This paper proposes a many-to-many Token Transforming framework for vision transformers, unifying existing token reduction methods into an explicit matrix transformation form, which minimizes information loss and enables training-free acceleration, achieving significant FLOPs reduction, inference speedup, and improved performance across various vision tasks such as segmentation, object detection, depth estimation, and language model generation. | ✔ | 202506 | arXiv (preprint) |
| Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [PDF] | This paper introduces LLaVA-Meteor, a novel visual instruction tuning framework that achieves significant visual token compression (75%–95%) and improved efficiency while maintaining or enhancing performance across 12 vision-language benchmarks through a Top-Down Compression paradigm, Flash Global Fusion module, and Visual-Native Selection mechanism. | ✗ | 202505 | arXiv (preprint) |
| VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [PDF] | This work proposes VScan, a two-stage visual token reduction framework for large vision-language models (LVLMs), achieving significant inference acceleration (2.91× speedup in prefilling, 10× FLOPs reduction) with minimal performance loss (95.4% retention) through complementary global/local token merging and intermediate-layer pruning. | ✔ | 202505 | arXiv (preprint) |
| PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models. [PDF] [Github] | We introduce PACT, a method that reduces inference time and memory usage in visual language models by pruning irrelevant tokens and merging visually redundant ones early in the model using a novel importance metric and Distance Bounded Density Peak Clustering. | ✔ | 202504 | CVPR 2025 |
| Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs [PDF] [Github] | TRIM (Token Reduction using CLIP Metric) enhances Multimodal Large Language Models (MLLMs) efficiency by reducing image tokens without performance loss, validated across 12 datasets, advancing sustainable high-performance model development. | ✗ | 202409 | COLING 2025 |
| TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model [PDF] | We propose TopV, a training-free token pruning method for Vision-Language Models that formulates pruning as an optimization problem using a visual-aware cost function, achieving efficient inference with reduced memory and computational cost while maintaining performance. | ✔ | 202503 | CVPR2025 |
| DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models [PDF][Github] | We propose DivPrune, a token pruning method for Large Multimodal Models that formulates pruning as a Max-Min Diversity Problem to maximize diversity among selected visual tokens, achieving state-of-the-art accuracy with reduced latency and memory usage across 16 image- and video-language datasets. | ✔ | 202503 | CVPR 2025 |
| InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression [PDF][Github] | We propose InternVL-X, a vision-language model that improves performance and efficiency through three visual token compression techniques—PVTC, LVTC, and RVTC—enabling state-of-the-art results with significantly reduced computational cost by using 20% or fewer visual tokens. | ✗ | 202503 | arXiv (preprint) |
| An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [PDF] [Github] | We propose FastV, a plug-and-play method for optimizing computational efficiency in Large Vision-Language Models (LVLMs) by learning adaptive attention patterns and pruning visual tokens, achieving significant reductions in FLOPs (e.g., 45% for LLaVA-1.5-13B) while maintaining strong performance across image and video understanding tasks, making it highly suitable for edge deployment and commercial applications. | ✔ | 202403 | ECCV 2024 (Oral) |
An LMM typically processes a pair of inputs, denoted as
$(T,V)$ , where T is the text input and$V$ is the visual input such as image or video.The text input is mapped to$N$ textual tokens$E_t={t_1, \dots, t_N}$ using a text encoder.Similarly, the visual input is processed by a corresponding vision encoder. Specifically, it takes visual information$V$ as input and outputs image features, that are further converted to$M$ (generally$M \gg N$ )vision tokens$E_v={v_1,\dots, v_M}$ using a projector layer.
The textual tokens and visual tokens are then combined
to be fed to an LLM to generate the prediction in an autoregressive manner. Specifically,
TFLOP ratio is the TFLOP of the model with pruned tokens relative to the original model’s TFLOP with no pruning.
where
| Method | Venue | GQA | MMB | MMBCN | MME | POPE | SQAIMG | VQAV2 | VQAText | VizWiz | Average | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Upper Bound, 576 Tokens (100%), 3,817 TFLOPs | ||||||||||||
| LLaVA-1.5-7B | 61.9 | 64.7 | 58.1 | 1862 | 85.9 | 69.5 | 78.5 | 58.2 | 50.0 | 100.0% | ||
| Retain 192 Tokens in Average (↓ 66.6%), ~1,253 TFLOPs | ||||||||||||
| ToMe [7] | 54.3 | 60.5 | - | 1563 | 72.4 | 65.2 | 68.0 | 52.1 | - | 88.5% | ||
| FastV [12] | 52.7 | 61.2 | 57.0 | 1612 | 64.8 | 67.3 | 67.1 | 52.5 | 50.8 | 90.4% | ||
| SparseVLM [69] | 57.6 | 62.5 | 53.7 | 1721 | 83.6 | 69.1 | 75.6 | 56.1 | 50.5 | 96.1% | ||
| PyramidDrop [60] | 57.3 | 63.3 | 56.8 | 1797 | 82.3 | 69.0 | 75.1 | 56.5 | 51.1 | 97.2% | ||
| VisionZip | 59.3 | 63.0 | - | 1783 | 85.3 | 68.9 | 77.4 | 57.3 | - | 97.8% | ||
| VScan (Ours) | 60.6 | 63.9 | 57.4 | 1806 | 86.2 | 68.6 | 77.8 | 57.7 | 50.4 | 99.0% | ||
| Retain 128 Tokens in Average (↓ 77.8%), ~833 TFLOPs | ||||||||||||
| ToMe | 52.4 | 53.3 | - | 1343 | 62.8 | 59.6 | 63.0 | 49.1 | - | 80.4% | ||
| FastV | 49.6 | 56.1 | 56.4 | 1490 | 59.6 | 60.2 | 61.8 | 50.6 | 51.3 | 85.4% | ||
| SparseVLM | 56.0 | 60.0 | 51.1 | 1696 | 80.5 | 67.1 | 73.8 | 54.9 | 51.4 | 93.7% | ||
| PyramidDrop | 57.1 | 61.6 | 56.6 | 1761 | 82.3 | 68.4 | 72.9 | 56.6 | 51.0 | 96.2% | ||
| VisionZip | 57.6 | 62.0 | - | 1763 | 83.2 | 68.9 | 75.6 | 56.8 | - | 96.2% | ||
| VScan (Ours) | - | 59.8 | 63.0 | 58.0 | 1792 | 86.1 | 68.9 | 77.1 | 57.3 | 51.7 | 98.8% | |
| Retain 64 Tokens in Average (↓ 88.9%), ~415 TFLOPs | ||||||||||||
| ToMe | 48.6 | 43.7 | - | 1138 | 52.5 | 50.0 | 57.1 | 45.3 | - | 70.1% | ||
| FastV | 46.1 | 48.0 | 52.7 | 1256 | 48.0 | 51.1 | 55.0 | 47.8 | 50.8 | 76.7% | ||
| SparseVLM | 52.7 | 56.2 | 46.1 | 1505 | 75.1 | 62.2 | 68.2 | 51.8 | 50.1 | 87.2% | ||
| PyramidDrop | 47.5 | 58.8 | 50.5 | 1561 | 55.9 | 69.2 | 69.2 | 50.6 | 50.7 | 86.6% | ||
| VisionZip | 55.1 | 60.1 | - | 1690 | 77.0 | 69.0 | 72.4 | 55.5 | - | 92.7% | ||
| VScan (Ours) | - | 58.3 | 62.1 | 55.7 | 1698 | 85.0 | 69.1 | 75.4 | 55.6 | 51.8 | 96.7% | |
| Model | LLM | PT/IT | Token | GQA | VizWiz | MMB | MMVet | MMMU | POPE | SEED | Avg | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MobileVLM V2 | Mobilellama-2.7B | 1.2M/3.6M | 144 | 52.1 | - | - | - | 59.3 | - | - | - | - | - | 84.3 | - | - |
| BLIP-2 | Vicuna-13B | 129M/- | 32 | 42.5 | - | - | - | 41.0 | 65.0 | 19.6 | - | - | - | 85.3 | 49.7 | - |
| Instruct-BLIP | Vicuna-7B | 129M/1.2M | 64 | 50.1 | - | - | - | 49.5 | 34.5 | - | 26.3 | - | - | - | - | - |
| QwenVL | Qwen-7B | 1.4B/50M | 256 | 63.8 | 65.1 | 65.7 | - | 59.3 | 78.8 | 35.2 | - | - | - | 62.3 | - | - |
| VILA | Llama2-7B | 50M/1M | 576 | 64.4 | - | 58.6 | - | 62.3 | 79.9 | 57.8 | 68.9 | 34.9 | - | 85.5 | - | - |
| MobileVLM V2 | Vicuna-7B | 1.2M/3.6M | 144 | 62.3 | - | - | - | 62.6 | - | - | - | - | - | 85.3 | - | - |
| Mini-Gemini | Vicuna-7B | 1.2M/1.5M | 576 | 65.9 | - | - | - | - | - | 68.5 | 46.0 | 38.1 | - | - | - | - |
| LLaVA-1.5 | Vicuna-7B | 558K/665K | 576 | 58.2 | 28.1 | - | 25.8 | 63.3 | 78.5 | 50.0 | 64.3 | 31.1 | 35.3 | 85.9 | 66.1 | - |
| TokenPacker | Vicuna-7B | 558K/665K | 144 | - | 26.9 | 18.1 | 21.8 | 61.9 | 77.9 | 52.0 | 65.1 | 33.0 | - | 87.0 | - | - |
| InternVL2 | Internlm2.5-7B | 558K/665K | 256 | 49.7 | - | - | - | 63.0 | 77.8 | 50.6 | 70.9 | 34.1 | 39.2 | 86.8 | 71.1 | 50.8 |
| High - resolution LLMs | ||||||||||||||||
| Monkey | Qwen-7B | -/1.44M | ~1024 | 67.7 | 66.5 | 36.1 | - | 60.7 | 80.3 | 61.2 | - | - | - | - | - | - |
| TokenPacker-HD | Vicuna-7B | 1.2M/1.5M | ~954 | 68.0 | 60.2 | - | - | - | 81.2 | 54.7 | 67.4 | - | 35.4 | - | - | - |
| Mini-Gemini-HD | Vicuna-7B | 1.2M/1.5M | 2880 | 68.4 | 65.0 | - | - | - | 80.3 | 54.6 | 65.8 | 41.3 | 36.8 | 86.8 | - | - |
| FastVITHD | Qwen-2-7B | 558K/1.1M | 256 | 64.4 | - | - | - | - | 63.1 | - | - | - | - | 88.1 | - | - |
| LLaVA-UHD | Vicuna-13B | 595K/665K | ~256 | 67.7 | 62.6 | 56.3 | 36.8 | 63.8 | 81.7 | 56.1 | 68.0 | 42.1 | 35.5 | 89.1 | 65.6 | 60.4 |
| LLaVA-NeXT | Vicuna-7B | 558K/765K | ~2880 | 64.9 | 74.4 | 54.8 | 37.1 | 64.2 | 81.8 | 57.6 | 68.1 | 43.9 | 35.8 | 86.5 | 68.2 | 61.4 |
| InternVL2-HD | Internlm2.5-7B | 558K/770K | ~1282 | 65.6 | 72.6 | 69.8 | 30.9 | 63.2 | 78.9 | 56.3 | 72.1 | 35.7 | 39.9 | 87.3 | 73.4 | 62.1 |
| Ours | ||||||||||||||||
|
LLaVA-Meteor compare to LLaVA-UHD |
Vicuna-13B | 595K/665K | ~256 | 69.9 100% |
64.2 +2.2 |
59.0 +1.6 |
39.2 +2.4 |
64.9 +1.1 |
82.4 +0.7 |
59.3 +3.2 |
69.4 +1.4 |
44.7 +2.6 |
37.5 +2.0 |
89.9 +0.8 |
67.7 +2.1 |
62.4 +2.0 |
|
LLaVA-Meteor compare to LLaVA-UHD |
Vicuna-13B | 595K/665K | ~114 | 68.3 44.5% |
63.1 +0.6 |
58.6 +0.5 |
37.7 +2.3 |
64.6 +0.8 |
81.8 +0.1 |
57.1 +1.0 |
68.4 +0.4 |
42.7 +0.6 |
34.6 -0.8 |
88.7 -0.5 |
66.9 +1.3 |
61.0 +0.6 |
|
LLaVA-Meteor compare to LLaVA-UHD |
Vicuna-13B | 595K/665K | ~56 | 65.0 21.8% |
58.4 -2.7 |
56.5 -4.2 |
37.1 +0.2 |
62.4 +0.3 |
81.2 -1.4 |
55.3 -0.5 |
68.0 +0.0 |
41.6 -0.5 |
34.2 -1.3 |
87.2 -1.9 |
64.8 -0.8 |
59.3 -1.1 |