Awesome Visual Large Language Models (VLLMs)

🔥🔥🔥 Visual Large Language Models for Generalized and Specialized Applications

Vision language models (VLMs) have emerged as powerful tools for learning unified embedding spaces that integrate vision and language. Inspired by the success of large language models (LLMs), which have demonstrated remarkable reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining significant attention for developing both general-purpose and specialized VLMs.

In this repository, we provide a comprehensive summary of the current literature from an application-oriented perspective. We hope this resource serves as a valuable reference for the VLLM research community.

If you are interested in this project, you can contribute to this repo by pulling requests 😊😊😊

If you think our paper is helpful for your research, you can cite through this bib tex entry!

@article{li2025visual,
  title={Visual Large Language Models for Generalized and Specialized Applications},
  author={Li, Yifan and Lai, Zhixin and Bao, Wentao and Tan, Zhen and Dao, Anh and Sui, Kewei and Shen, Jiayi and Liu, Dong and Liu, Huan and Kong, Yu},
  journal={arXiv preprint arXiv:2501.02765},
  year={2025}
}

📢 News

🚀 What's New in This Update:

[2025.7.28]: 🔥 Adding five papers on autonomous driving, vision generation and video understanding!
[2025.4.25]: 🔥 Adding eleven papers on complex reasoning, face, video understanding and medical!
[2025.4.18]: 🔥 Adding three papers on complex reasoning!
[2025.4.12]: 🔥 Adding one paper on complex reasoning and one paper on efficiency!
[2025.3.20]: 🔥 Adding eleven papers and 1 wonderful repo on complex reasoning!
[2025.3.10]: 🔥 Adding three papers on complex reasoning, efficiency and face!
[2025.3.6]: 🔥 Adding one paper on complex reasoning!
[2025.3.2]: 🔥 Adding two projects on complex reasoning: R1-V and VLM-R1!
[2025.2.23]: 🔥 Adding one video-to-action paper and one vision-to-text paper!
[2025.2.1]: 🔥 Adding four video-to-text papers!
[2025.1.22]: 🔥 Adding one video-to-text paper!
[2025.1.17]: 🔥 Adding three video-to-text papers, thanks for the contributions from Enxin!
[2025.1.14]: 🔥 Adding two complex reasoning papers and one video-to-text paper!
[2025.1.13]: 🔥 Adding one VFM survey paper!
[2025.1.12]: 🔥 Adding one efficient MLLM paper!
[2025.1.9]: 🔥🔥🔥 Adding one efficient MLLM survey!
[2025.1.7]: 🔥🔥🔥 Our survey paper is released! Please check this link for more information. We add more tool management papers in our paper list.
[2025.1.6]: 🔥 We add one OS Agent survey paper in our paper list, and a new category: complex reasoning!
[2025.1.4]: 🔥 We updated the general domain and egocentric video papers in our paper list, thanks for the contributions from Wentao!
[2025.1.2]: 🔥 We add more interpretation papers to our paper list, thanks for the contributions from Ruoyu!
[2024.12.15]: 🔥 We release our VLLM application paper list repo!

🌈 Table of Contents

Visual Large Language Models for Generalized and Specialized Applications
Contributors

Existing VLM surveys

VLM surveys

Title	Venue	Date	Code	Project
A Survey on Bridging VLMs and Synthetic Data	OpenReview	2025-05-16	Github	Project
Foundation Models Defining a New Era in Vision: A Survey and Outlook	T-PAMI	2025-1-9	Github	Project
Vision-Language Models for Vision Tasks: A Survey	T-PAMI	2024-8-8	Github	Project
Vision + Language Applications: A Survey	CVPRW	2023-5-24	Github	Project
Vision-and-Language Pretrained Models: A Survey	IJCAI (survey track)	2022-5-3	Github	Project

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
assets		assets
.DS_Store		.DS_Store
README.md		README.md

Title	Venue	Date	Code	Project
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use	ArXiv	2024-12-27	Github	Project
Towards Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey	ArXiv	2024-12-3	Github	Project
A Survey on Multimodal Large Language Models	T-PAMI	2024-11-29	Github	Project
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs	ArXiv	2024-11-22	Github	Project
A Survey on Multimodal Large Language Models	National Science Review	2024-11-12	Github	Project
Video Understanding with Large Language Models: A Survey	ArXiv	2024-6-24	Github	Project
A Survey on Multimodal Benchmarks: In the Era of Large AI Models	ArXiv	2024-9-21	Github	Project
The Revolution of Multimodal Large Language Models: A Survey	ArXiv	2024-6-6	Github	Project
Efficient Multimodal Large Language Models: A Survey	ArXiv	2024-5-17	Github	Project
A Survey on Hallucination in Large Vision-Language Models	ArXiv	2024-5-6	Github	Project
Hallucination of multimodal large language models: A survey	ArXiv	2024-4-29	Github	Project
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions	ArXiv	2024-4-12	Github	Project
MM-LLMs: Recent Advances in MultiModal Large Language Models	ArXiv	2024-2-20	Github	Project
Exploring the Reasoning Abilities of Multimodallarge Language Models (mllms): a Comprehensive survey on Emerging Trends in Multimodal Reasonings	ArXiv	2024-1-18	Github	Project
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey	ArXiv	2023-12-27	Github	Project
Multimodal Large Language Models: A Survey	BigData	2023-12-15	Github	Project

Name	Title	Venue	Date	Code	Project
UniME	Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs	ArXiv	2025-4-24	Github	Project
InternVL2.5	Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	ArXiv	2024-12-17	Github	Project
CompCap	CompCap: Improving Multimodal Large Language Models with Composite Captions	ArXiv	2024-12-06	Github	Project
NVILA	NVILA: Efficient Frontier Visual Language Models	ArXiv	2024-12-05	Github	Project
Molmo and PixMo	Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models	ArXiv	2024-09-25	Github	Project
Qwen2-VL	Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution	ArXiv	2024-09-18	Github	Project
mPLUG-Owl3	mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	ArXiv	2024-08-09	Github	Project
LLaVA-OneVision	LLaVA-OneVision: Easy Visual Task Transfer	ArXiv	2024-08-06	Github	Project
VILA$^{2}$	VILA $^2$: VILA Augmented VILA	ArXiv	2024-07-24	Github	Project
EVLM	EVLM: An Efficient Vision-Language Model for Visual Understanding	ArXiv	2024-07-19	Github	Project
MG-LLaVA	MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning	ArXiv	2024-06-27	Github	Project
Cambrian-1	Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	ArXiv	2024-06-24	Github	Project
Ovis	Ovis: Structural Embedding Alignment for Multimodal Large Language Model	ArXiv	2024-05-31	Github	Project
ConvLLaVA	ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	ArXiv	2024-05-24	Github	Project
Meteor	Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models	NeurIPS	2024-05-24	Github	Project
CuMo	CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	ArXiv	2024-05-09	Github	Project
Mini-Gemini	Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	ArXiv	2024-03-27	Github	Project
MM1	MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training	ArXiv	2024-03-14	Github	Project
DeepSeek-VL	DeepSeek-VL: Towards Real-World Vision-Language Understanding	ArXiv	2024-03-08	Github	Project
InternLM-XComposer2	InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model	ArXiv	2024-01-29	Github	Project
MoE-LLaVA	MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	ArXiv	2024-01-29	Github	Project
InternVL	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	CVPR	2023-12-21	Github	Project
VILA	VILA: On Pre-training for Visual Language Models	ArXiv	2023-12-12	Github	Project
Vary	Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	ECCV	2023-12-11	Github	Project
Honeybee	Honeybee: Locality-enhanced Projector for Multimodal LLM	CVPR	2023-11-11	Github	Project
OtterHD	OtterHD: A High-Resolution Multi-modality Model	ArXiv	2023-11-07	Github	Project
mPLUG-Owl2	mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	CVPR	2023-11-07	Github	Project
Fuyu	Fuyu-8B: A Multimodal Architecture for AI Agents	ArXiv	2023-10-17	Github	Project
MiniGPT-v2	MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	ArXiv	2023-10-14	Github	Project
LLaVA 1.5	Improved Baselines with Visual Instruction Tuning	ArXiv	2023-10-05	Github	Project
InternLM-XComposer	InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition	ArXiv	2023-09-26	Github	Project
Qwen-VL	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	ArXiv	2023-08-24	Github	Project
StableLLaVA	StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	ArXiv	2023-08-20	Github	Project
BLIVA	BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions	AAAI	2023-08-19	Github	Project
SVIT	SVIT: Scaling up Visual Instruction Tuning	ArXiv	2023-07-09	Github	Project
LaVIN	Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models	NeurIPS	2023-05-24	Github	Project
InstructBLIP	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	NeurIPS	2023-05-11	Github	Project
MultiModal-GPT	MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	ArXiv	2023-05-08	Github	Project
Otter	Otter: A Multi-Modal Model with In-Context Instruction Tuning	ArXiv	2023-05-05	Github	Project
mPLUG-Owl	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	ArXiv	2023-04-27	Github	Project
LLaMA-Adapter V2	LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	ArXiv	2023-04-28	Github	Project
MiniGPT-4	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	NeurIPS	2023-04-20	Github	Project
LLaVA	Visual Instruction Tuning	NeurIPS	2023-04-17	Github	Project
LLaMA-Adapter	LLaMA-Adapter: Efficient Fine-tuning of Large Language Models with Zero-initialized Attention	ICLR	2023-03-28	Github	Project
Kosmos-1	Language Is Not All You Need: Aligning Perception with Language Models	NeurIPS	2023-02-27	Github	Project
Flamingo	Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS	2022-04-29	Github	Project

Name	Title	Venue	Date	Code	Project
ChatRex	ChatRex: Taming Multimodal LLM for Joint Perception and Understanding	ArXiv	2024-11-27	Github	Project
Griffon-G	Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models	ArXiv	2024-10-21	Github	Project
Ferret	Ferret: Refer and Ground Anything Anywhere at Any Granularity	ICLR	2024-10-11	Github	Project
OMG-LLaVA	OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding	NeurIPS	2024-06-27	Github	Project
VisionLLMv2	VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	ArXiv	2024-06-12	Github	Project
Groma	Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	ECCV	2024-04-19	Github	Project
Griffonv2	Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring	ArXiv	2024-03-14	Github	Project
ASMv2	The All-Seeing Project V2: Towards General Relation Comprehension of the Open World	ECCV	2024-02-29	Github	Project
SPHINX-X	SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models	ArXiv	2024-02-08	Github	Project
ChatterBox	ChatterBox: Multi-round Multimodal Referring and Grounding	ArXiv	2024-01-24	Github	Project
LEGO	LEGO: Language Enhanced Multi-modal Grounding Model	ArXiv	2024-01-12	Github	Project
GroundingGPT	GroundingGPT: Language Enhanced Multi-modal Grounding Model	ACL	2024-01-11	Github	Project
BuboGPT	BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	ArXiv	2024-07-17	Github	Project
Ferret-v2	Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models	COLM	2024-04-11	Github	Project
InfMLLM	InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory	NeurIPS	2024-02-07	Github	Project
VistaLLM	Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model	ECCV	2023-12-19	Github	Project
LLaVA-Grounding	LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models	ArXiv	2023-12-05	Github	Project
Lenna	Lenna: Language Enhanced Reasoning Detection Assistant	ArXiv	2023-12-05	Github	Project
Griffon	Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models	ECCV	2023-11-24	Github	Project
Lion	Lion: Empowering multimodal large language model with dual-level visual knowledge	CVPR	2023-11-20	Github	Project
SPHINX	SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	ArXiv	2023-11-13	Github	Project
NExT-Chat	NExT-Chat: An LMM for Chat, Detection and Segmentation	ArXiv	2023-11-08	Github	Project
GLaMM	GLaMM: Pixel Grounding Large Multimodal Model	CVPR	2023-11-06	Github	Project
CogVLM	CogVLM: Visual Expert for Pretrained Language Models	ArXiv	2023-11-06	Github	Project
Pink	Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs	CVPR	2023-10-01	Github	Project
PVIT	Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models	ArXiv	2023-08-25	Github	Project
ASM	The all-seeing project: Towards panoptic visual recognition and understanding of the open world	ICLR	2023-08-03	Github	Project
Shikra	Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	ArXiv	2023-06-27	Github	Project
Kosmos-2	KOSMOS-2: Grounding Multimodal Large LanguageModels to the World	ICLR	2023-06-26	Github	Project
ChatSpot	ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	ArXiv	2023-07-18	Github	Project
GPT4RoI	GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest	IJCAI	2023-07-07	Github	Project
ContextDET	Contextual Object Detection with Multimodal Large Language Models	ArXiv	2023-05-29	Github	Project
DetGPT	DetGPT: Detect What You Need via Reasoning	ArXiv	2023-05-23	Github	Project
VisionLLM	VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks	ArXiv	2023-05-18	Github	Project

Name	Title	Venue	Date	Code	Project
TextHawk2	TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens	ArXiv	2024-10-07	Github	Project
Dockylin	DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming	AAAI	2024-06-27	Github	Project
StrucTexTv3	StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond	ArXiv	2024-05-31	Github	Project
Fox	Focus Anywhere for Fine-grained Multi-page Document Understanding	ArXiv	2024-05-23	Github	Project
TextMonkey	TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document	ArXiv	2024-05-07	Github	Project
TinyChart	TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning	ACL	2024-04-25	Github	Project
TextHawk	TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models	ArXiv	2024-04-14	Github	Project
HRVDA	HRVDA: High-Resolution Visual Document Assistant	CVPR	2024-04-10	Github	Project
InternLM-XComposer2-4KHD	InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD	NeurIPS	2024-04-09	Github	Project
LayoutLLM	LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding	CVPR	2024-04-08	Github	Project
ViTLP	Visually Guided Generative Text-Layout Pre-training for Document Intelligence	NAACL	2024-03-25	Github	Project
mPLUG-DocOwl 1.5	mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding	ArXiv	2024-03-19	Github	Project
DoCo	Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models	CVPR	2024-02-29	Github	Project
TGDoc	Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs	ArXiv	2023-11-22	Github	Project
DocPedia	DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding	ArXiv	2023-11-20	Github	Project
UReader	UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model	ACL	2023-10-08	Github	Project
UniDoc	UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding	ArXiv	2023-08-19	Github	Project
mPLUG-DocOwl	mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	ArXiv	2023-07-04	Github	Project
LLaVAR	LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	ArXiv	2023-06-29	Github	Project

Name	Title	Venue	Date	Code	Project
EchoSight	EchoSight: Advancing Visual-Language Models with Wiki Knowledge	EMNLP	2024-07-17	Github	Project
FROMAGe	Grounding Language Models to Images for Multimodal Inputs and Outputs	ICML	2024-01-31	Github	Project
Wiki-LLaVA	Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs	CVPR	2023-04-23	Github	Project
UniMuR	Unified Embeddings for Multimodal Retrieval via Frozen LLMs	ICML	2019-05-08	Github	Project

Name	Title	Venue	Date	Code	Project
VHM	VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis	ArXiv	2024-11-06	Github	Project
LHRS-Bot	LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model	ECCV	2024-07-16	Github	Project
Popeye	Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery	J-STARS	2024-06-13	Github	Project
RS-LLaVA	RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery	Remote Sens.	2024-04-23	Github	Project
EarthGPT	EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain	TGRS	2024-03-08	Github	Project
RS-CapRet	Large Language Models for Captioning and Retrieving Remote Sensing Images	ArXiv	2024-02-09	Github	Project
SkyEyeGPT	SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model	ArXiv	2024-01-18	Github	Project
GeoChat	GeoChat: Grounded Large Vision-Language Model for Remote Sensing	CVPR	2023-11-24	Github	Project
RSGPT	RSGPT: A Remote Sensing Vision Language Model and Benchmark	ArXiv	2023-07-28	Github	Project

Name	Title	Venue	Date	Code	Project
EyecareGPT	EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model	ArXiv	2025-01-02	Github	Project
UMed-LVLM	Training Medical Large Vision-Language Models with Abnormal-Aware Feedback	ArXiv	2025-01-02	Github	Project
PMC-VQA	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	ArXiv	2024-09-08	Github	Project
MedVersa	A Generalist Learner for Multifaceted Medical Image Interpretation	ArXiv	2024-05-13	Github	Project
PeFoMed	PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging	ArXiv	2024-04-16	Github	Project
RaDialog	RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance	ArXiv	2023-11-30	Github	Project
Med-Flamingo	Med-Flamingo: a Multimodal Medical Few-shot Learner	ML4H	2023-07-27	Github	Project
XrayGPT	XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models	BioNLP	2023-06-13	Github	Project
LLaVA-Med	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	NeurIPS	2023-06-01	Github	Project
CXR-RePaiR-Gen	Retrieval Augmented Chest X-Ray Report Generation using OpenAI GPT models	MLHC	2023-05-05	Github	Project

Name	Title	Venue	Date	Code	Project
MAVIS	MAVIS: Mathematical Visual Instruction Tuning	ECCV	2024-11-01	Github	Project
Math-LLaVA	Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models	EMNLP	2024-10-08	Github	Project
MathVerse	MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?	ECCV	2024-08-18	Github	Project
We-Math	We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?	ArXiv	2024-07-01	Github	Project
CMMaTH	CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models	ArXiv	2024-06-28	Github	Project
GeoEval	GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving	ACL	2024-05-17	Github	Project
FigurA11y	FigurA11y: AI Assistance for Writing Scientific Alt Text	IUI	2024-04-05	Github	Project
MathVista	MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts	ICLR	2024-01-21	Github	Project
mPLUG-PaperOwl	mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model	ACM MM	2024-01-09	Github	Project
G-LLaVA	G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model	ArXiv	2023-12-18	Github	Project
T-SciQ	T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering	AAAI	2023-12-18	Github	Project
ScienceQA	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	NeurIPS	2022-10-17	Github	Project

Name	Title	Venue	Date	Code	Project
Graphist	Graphic Design with Large Multimodal Model	ArXiv	2024-04-22	Github	Project
Ferret-UI	Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	ECCV	2024-04-08	Github	Project
CogAgent	CogAgent: A Visual Language Model for GUI Agents	CVPR	2023-12-21	Github	Project

Name	Title	Venue	Date	Code	Project
FinTral	FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models	ACL	2024-06-14	Github	Project
FinVis-GPT	FinVis-GPT: A Multimodal Large Language Model for Financial Chart Analysis	ArXiv	2023-07-31	Github	Project

Name	Title	Venue	Date	Code	Project
TimeSoccer	TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation	ArXiv	2025-4-24	Github	Project
Eagle 2.5	Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models	ArXiv	2025-4-21	Github	Project
Camera-Bench	Towards Understanding Camera Motions in Any Video	ArXiv	2025-4-21	Github	Project
IV-Bench	IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs	ArXiv	2025-4-21	Github	Project
VideoChat-Online	Online Video Understanding: OVBench and VideoChat-Online	CVPR	2025-4-21	Github	Project
Mavors	Mavors: Multi-granularity Video Representation for Multimodal Large Language Model	ArXiv	2025-4-14	Github	Project
VideoLLaMA3	VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding	ArXiv	2025-1-22	Github	Project
Aria	ARIA : An Open Multimodal Native Mixture-of-Experts Model	ArXiv	2024-12-17	Github	Project
Apollo	Apollo: An Exploration of Video Understanding in Large Multimodal Models	ArXiv	2024-12-13	Github	Project
LinVT	LinVT: Empower Your Image-level Large Language Model to Understand Videos	ArXiv	2024-12-11	Github	Project
Video-LLaMA2	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	ArXiv	2024-10-30	Github	Project
LLaVA-OneVision	LLaVA-OneVision: Easy Visual Task Transfer	ArXiv	2024-10-26	Github	Project
Oryx	Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution	ICLR	2024-10-22	Github	Project
LongVU	LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding	ArXiv	2024-10-22	Github	Project
AuroraCap	AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark	Arxiv	2024-10-4	Github	Project
LLaVA-Video	Video Instruction Tuning With Synthetic Data	ArXiv	2024-10-04	Github	Project
SlowFast-LLaVA	SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	ArXiv	2024-9-15	Github	Project
InternVideo2	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	ArXiv	2024-8-14	Github	Project
mPLUG-Owl3	mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	ArXiv	2024-08-13	Github	Project
Goldfish	Goldfish: Vision-Language Understanding of Arbitrarily Long Videos	ECCV	2024-07-17	Github	Project
VoT	Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition	ICML	2024-07-17	Github	Project
Flash-VStream	Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams	ArXiv	2024-06-30	Github	Project
LLaVA-Next-Video	LLaVA-NeXT: A Strong Zero-shot Video Understanding Model	online	2024-04-30	Github	Project
PLLaVA	PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	Arxiv	2023-4-29	Github	Project
MovieChat+	MovieChat+: Question-aware Sparse Memory for Long Video Question Answering	Arxiv	2023-4-26	Github	Project
MiniGPT4-Video	MiniGPT4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens	CVPR Workshop	2024-04-04	Github	Project
ST-LLM	ST-LLM: Large language models are effective temporal learners	ECCV	2024-03-30	Github	Project
LLaMA-VID	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	ECCV	2023-11-28	Github	Project
MovieChat	Moviechat: From dense token to sparse memory for long video understanding	CVPR	2023-7-31	Github	Project
Video-LLaMA	Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	EMNLP	2023-10-25	Github	Project
Vid2Seq	Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	CVPR	2023-03-21	Github	Project
LaViLa	Learning Video Representations from Large Language Models	CVPR	2022-12-08	Github	Project
VideoBERT	VideoBERT: A joint model for video and language representation learning	ICCV	2019-09-11	Github	Project

Name	Title	Venue	Date	Code	Project
Video-LLaVA	Video-llava: Learning united visual representation by alignment before projection	EMNLP	2024-10-01	Github	Project
BT-Adapter	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning	CVPR	2024-06-27	Github	Project
VideoGPT+	VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	arXiv	2024-06-13	Github	Project
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	ACL	2024-06-10	Github	Project
MVBench	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	CVPR	2024-05-23	Github	Project
LVChat	LVCHAT: Facilitating Long Video Comprehension	ArXiv	2024-02-19	Github	Project
VideoChat	VideoChat: Chat-Centric Video Understanding	ArXiv	2024-01-04	Github	Project
Valley	Valley: Video Assistant with Large Language model Enhanced abilitY	ArXiv	2023-10-08	Github	Project

Name	Title	Venue	Date	Code	Project
StreamChat	Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	ICLR	2025-01-23	Github	Project
PALM	PALM: Predicting Actions through Language Models	CVPR Workshop	2024-07-18	Github	Project
GPT4Ego	GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition	ArXiv	2024-05-11	Github	Project
AntGPT	AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?	ICLR	2024-04-01	Github	Project
LEAP	LEAP: LLM-Generation of Egocentric Action Programs	ArXiv	2023-11-29	Github	Project
LLM-Inner-Speech	Egocentric Video Comprehension via Large Language Model Inner Speech	CVPR Workshop	2023-06-18	Github	Project
LLM-Brain	LLM as A Robotic Brain: Unifying Egocentric Memory and Control	ArXiv	2023-04-25	Github	Project
LaViLa	Learning Video Representations from Large Language Models	CVPR	2022-12-08	Github	Project

Name	Title	Venue	Date	Code	Project
DriveBench	Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives	ICCV	2025-1-7	Github	Project
DriveLM	DriveLM: Driving with Graph Visual Question Answering	ECCV	2024-7-17	Github	Project
Talk2BEV	Talk2BEV: Language-enhanced Bird’s-eye View Maps for Autonomous Driving	ICRA	2024-5-13	Github	Project
Nuscenes-QA	TNuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario	AAAI	2024-3-24	Github	Project
DriveMLM	DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving	ArXiv	2023-12-25	Github	Project
LiDAR-LLM	LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding	CoRR	2023-12-21	Github	Project
Dolphis	Dolphins: Multimodal Language Model for Driving	ArXiv	2023-12-1	Github	Project

Name	Title	Venue	Date	Code	Project
DriveGPT4	DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model	RAL	2024-8-7	Github	Project
SurrealDriver	SurrealDriver: Designing LLM-powered Generative Driver Agent Framework based on Human Drivers’ Driving-thinking Data	ArXiv	2024-7-22	Github	Project
DriveVLM	DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models	CoRL	2024-6-25	Github	Project
DiLu	DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models	ICLR	2024-2-22	Github	Project
LMDrive	LMDrive: Closed-Loop End-to-End Driving with Large Language Models	CVPR	2023-12-21	Github	Project
GPT-Driver	DGPT-Driver: Learning to Drive with GPT	NeurlPS Workshop	2023-12-5	Github	Project
ADriver-I	ADriver-I: A General World Model for Autonomous Driving	ArXiv	2023-11-22	Github	Project

Name	Title	Venue	Date	Code	Project
Seena	Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving	ArXiv	2024-10-29	Github	Project
BEV-InMLLM	Holistic autonomous driving understanding by bird’s-eye-view injected multi-Modal large model	CVPR	2024-1-2	Github	Project
Prompt4Driving	Language Prompt for Autonomous Driving	ArXiv	2023-9-8	Github	Project

Name	Title	Venue	Date	Code	Project
Wonderful-Team	Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs	ArXiv	2024-12-4	Github	Project
AffordanceLLM	AffordanceLLM: Grounding Affordance from Vision Language Models	CVPR	2024-4-17	Github	Project
3DVisProg	Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding	CVPR	2024-3-23	Github	Project
WREPLAN	REPLAN: Robotic Replanning with Perception and Language Models	ArXiv	2024-2-20	Github	Project
PaLM-E	PaLM-E: An Embodied Multimodal Language Model	ICML	2023-3-6	Github	Project

Name	Title	Venue	Date	Code	Project
OpenVLA	OpenVLA: An Open-Source Vision-Language-Action Model	ArXiv	2024-9-5	Github	Project
LLARVA	LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning	CoRL	2024-6-17	Github	Project
RT-X	Open X-Embodiment: Robotic Learning Datasets and RT-X Models	ArXiv	2024-6-1	Github	Project
RoboFlamingo	Vision-Language Foundation Models as Effective Robot Imitators	ICLR	2024-2-5	Github	Project
VoxPoser	VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models	CoRL	2023-11-2	Github	Project
ManipLLM	ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation	CVPR	2023-12-24	Github	Project
RT-2	RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	ArXiv	2023-7-28	Github	Project
Instruct2Act	Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model	ArXiv	2023-5-24	Github	Project

Name	Title	Venue	Date	Code	Project
LLaRP	Large Language Models as Generalizable Policies for Embodied Tasks	ICLR	2024-4-16	Github	Project
MP5	MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception	CVPR	2024-3-24	Github	Project
LL3DA	LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	CVPR	2023-11-30	Github	Project
EmbodiedGPT	EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	NeurlPS	2023-11-2	Github	Project
ELLM	Guiding Pretraining in Reinforcement Learning with Large Language Models	ICML	2023-9-15	Github	Project
3D-LLM	3D-LLM: Injecting the 3D World into Large Language Models	NeurlPS	2023-7-24	Github	Project
NLMap	Open-vocabulary Queryable Scene Representations for Real World Planning	ICRA	2023-7-4	Github	Project

JackYFL/awesome-VLLMs

Folders and files

Latest commit

History

Repository files navigation

Awesome Visual Large Language Models (VLLMs)

🔥🔥🔥 Visual Large Language Models for Generalized and Specialized Applications

📢 News

🌈 Table of Contents

Existing VLM surveys

VLM surveys

MLLM surveys

Vision-to-text

Image-to-text

General domain

General ability

REC

RES

OCR

Retrieval

VLLM+X

Remote sensing

Medical

Science and math

Graphics and UI

Financial analysis

Video-to-text

General domain

Video conversation

Egocentric view

Vision-to-action

Autonomous driving

Perception

Planning

Prediction

Embodied AI

Perception

Manipulation

Planning

Navigation

Automated tool management

Text-to-vision

Text-to-image

Text-to-3D

Text-to-video

Other applications

Face

Anomaly Detetcion

Gaming

Challenges

Efficiency

Security

Interpretability and explainability

Complex reasoning

Contributors

Star history

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Uh oh!

Packages

Name	Title	Venue	Date	Code	Project
ConceptGraphs	ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning	ICRA	2024-5-13	Github	Project
RILA	RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation	CVPR	2024-4-27	Github	Project
EMMA	Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld	CVPR	2024-3-29	Github	Project
VLN-VER	Volumetric Environment Representation for Vision-Language Navigation	CVPR	2024-3-24	Github	Project
MultiPLY	MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World	CVPR	2024-1-16	Github	Project

Name	Title	Venue	Date	Code	Project
Falcon-UI	Falcon-UI: Understanding GUI Before Following User Instructions	arXiv	2024-12-12	Github	Project
AGENTTREK	AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials	arXiv	2024-12-12	Github	Project
Aguvis	Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction	arXiv	2024-12-12	Github	Project
ScribeAgent	ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data	ArXiv	2024-12-5	Github	Project
ShowUI	ShowUI: One Vision-Language-Action Model for GUI Visual Agent	NeurlPS Workshop	2024-11-26	Github	Project
MultiUI	Harnessing Webpage UIs for Text-Rich Visual Understanding	ArXiv	2024-11-6	Github	Project
EDGE	EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data	ArXiv	2024-11-2	Github	Project
AndroidLab	AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents	NeurlPS Workshop	2024-10-30	Github	Project
OS-ATLAS	OS-ATLAS: A Foundation Action Model for Generalist GUI Agents	ArXiv	2024-10-30	Github	Project
AutoGLM	AutoGLM: Autonomous Foundation Agents for GUIs	ArXiv	2024-10-30	Github	Project
Ferret-UI 2	Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms	ArXiv	2024-10-24	Github	Project
Tool-LMM	Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning	arXiv	2024-1-19	Github	Project
CLOVA	CLOVA: A Closed-loop Visual Assistant with Tool Usage and Update	CVPR	2023-12-18	Github	Project
CRAFT	CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets	arXiv	2023-9-29	Github	Project
Confucius	Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum	AAAI	2023-8-27	Github	Project
AVIS	Avis: Autonomous visual information seeking with large language model agent	NeurIPS	2023-6-13	Github	Project
GPT4Tools	GPT4Tools: Teaching large language model to use tools via self-instruction	NeurIPS	2023-5-30	Github	Project
ToolkenGPT	ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings	NeurIPS	2023-5-19	Github	Project
Chameleon	Chameleon: Plug-and-play compositional reasoning with large language models	NeurIPS	2023-4-19	Github	Project
HuggingGPT	HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face	NeurIPS	2023-3-30	Github	Project
TaskMatrix.AI	TaskMatrix.AI: Completing tasks by connecting foundation models with millions of APIs	Intelligent Computing (AAAS)	2023-3-29	Github	Projecct
MM-ReACT	MM-ReACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-3-20	Github	Project
ViperGPT	ViperGPT: Visual Inference via Python Execution for Reasoning	ICCV	2023-3-14	Github	Project
MIND’S EYE	MIND’S EYE: GROUNDED LANGUAGE MODEL REASONING THROUGH SIMULATION	arXiv	2022-10-11	GitHub	Project

Name	Title	Venue	Date	Code	Project
BAGEL	Emerging Properties in Unified Multimodal Pretraining	ArXiv	2025-5-23	GitHub	Project
X-Fusion	Emerging Properties in Unified Multimodal Pretraining	ICCV	2025-4-29	GitHub	Project
LLMGA	LLMGA: Multimodal Large Language Model based Generation Assistant	ECCV	2024-7-27	GitHub	Project
Emu	Generative pretraining in multimodality,	ICLR	2024-5-8	GitHub	Project
Kosmos-G	Kosmos-G: Generating Images in Context with Multimodal Large Language Models	ICLR	2024-4-26	GitHub	Project
LaVIT	Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	ICLR	2024-3-22	GitHub	Project
MiniGPT-5	MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens	ArXiv	2024-3-15	GitHub	Project
LMD	LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models	TMLR	2024-3-4	GitHub	Project
DiffusionGPT	DiffusionGPT: LLM-Driven Text-to-Image Generation System	ArXiv	2024-1-18	GitHub	Project
VL-GPT	VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	ArXiv	2023-12-4	GitHub	Project
CoDi-2	CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	CVPR	2023-11-30	GitHub	Project
SEED-LLAMA	Making LLaMA SEE and Draw with SEED Tokenizer	CVPR	2023-10-3	GitHub	Project
JAM	Jointly Training Large Autoregressive Multimodal Models	ICLR	2023-9-28	GitHub	Project
CM3Leon	Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning	ArXiv	2023-9-5	GitHub	Project
SEED	Planting a SEED of Vision in Large Language Model	ICLR	2023-8-12	GitHub	Project
GILL	Generating Images with Multimodal Language Models	NeurlPS	2023-5-26	GitHub	Project

Name	Title	Venue	Date	Code	Project
3DGPT	3D-GPT: Procedural 3D Modeling with Large Language Models	ArXiv	2024-5-29	GitHub	Project
Holodeck	Holodeck: Language Guided Generation of 3D Embodied AI Environments	CVPR	2024-4-22	GitHub	Project
LLMR	LLMR: Real-time Prompting of Interactive Worlds using Large Language Models	ACM CHI	2024-3-22	GitHub	Project
GPT4Point	GPT4Point: A Unified Framework for Point-Language Understanding and Generation	ArXiv	2023-12-1	GitHub	Project
ShapeGPT	ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model	ArXiv	2023-12-1	GitHub	Project
MeshGPT	MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers	ArXiv	2023-11-27	GitHub	Project
LI3D	Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback	NeurlPS	2023-5-26	GitHub	Project

Name	Title	Venue	Date	Code	Project
Mora	Mora: Enabling Generalist Video Generation via A Multi-Agent Framework	ArXiv	2024-10-3	GitHub	Project
VideoStudio	VideoStudio: Generating Consistent-Content and Multi-Scene Videos	ECCV	2024-9-16	GitHub	Project
VideoDirectorGPT	VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning	COLM	2024-7-12	GitHub	Project
VideoPoet	VideoPoet: A Large Language Model for Zero-Shot Video Generation	ICML	2024-6-4	GitHub	Project
MAGVIT-v2	Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation	ICLR	2024-3-29	GitHub	Project
LLM-groundedDiffusion	LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models	TMLR	2023-11-27	GitHub	Project
SVD	Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets	TMLR	2023-11-27	GitHub	Project
Free-Bloom	Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator	NeurlPS	2023-9-25	GitHub	Project

Name	Title	Venue	Date	Code	Project
SoVTP	Visual and textual prompts for enhancing emotion recognition in video	arXiv	2025-4-24	Github	Project
FaceInsight	FaceInsight: A Multimodal Large Language Model for Face Perception	arXiv	2025-4-22	Github	Project
FVQ	FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment	arXiv	2025-4-21	Github	Project
Emotion-LLaMA	Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning	arXiv	2024-11-2	Github	Project
Face-MLLM	Face-MLLM: A Large Face Perception Model	arXiv	2024-10-28	Github	Project
ExpLLM	ExpLLM: Towards Chain of Thought for Facial Expression Recognition	arXiv	2024-9-4	Github	Project
EMO-LLaMA	EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning	arXiv	2024-8-21	Github	Project
EmoLA	Facial Affective Behavior Analysis with Instruction Tuning	ECCV	2024-7-12	Github	Project
EmoLLM	EmoLLM: Multimodal Emotional Understanding Meets Large Language Models	ArXiv	2024-6-29	Github	Project

Name	Title	Venue	Date	Code	Project
HAWK	HAWK: Learning to Understand Open-World Video Anomalies	NeurlPS	2024-5-27	Github	Project
CUVA	Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly	CVPR	2024-5-6	Github	Project
LAVAD	Harnessing Large Language Models for Training-free Video Anomaly Detectiong	CVPR	2024-4-1	Github	Project

Name	Title	Venue	Date	Code	Project
ADAM	Adam: An Embodied Causal Agent in Open-World Environments	ArXiv	2024-10-29	Github	Project
VARP	Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case	ArXiv	2024-09-19	Github	Project
DLLM	World Models with Hints of Large Language Models for Goal Achieving	ArXiv	2024-06-11	Github	Project
MineDreamer	MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control	NeurIPS 2024 Workshop	2024-03-18	Github	Project
HAS	Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation	ICLR	2024-03-13	Github	Project
CRADLE	CRADLE: Empowering Foundation Agents Towards General Computer Control	ArXiv	2024-03-05	Github	Project
Atari-GPT	Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games	ArXiv	2024-03-05	Github	Project
MP5	MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception	CVPR	2023-12-12	Github	Project
STEVE	See and Think: Embodied Agent in Virtual Environment	ECCV	2023-11-26	Github	Project
STEVE-EYE	Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds	ICLR	2023-10-20	Github	Project
JARVIS-1	JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models	ArXiv	2023-10-11	Github	Project

Name	Title	Venue	Date	Code	Project
WiCo	Window Token Concatenation for Efficient Visual Large Language Models	CVPRW	2025-4-5	Github	Project
DART	Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More	ArXiv	2025-2-17	Github	Project
LLaVA-Mini	LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token	ICLR	2025-1-7	Github	Project
Dynamic-VLM	Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM	ArXiv	2024-12-12	Github	Project
PVC	PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models	ArXiv	2024-12-12	Github	Project
iLLaVA	iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models	ArXiv	2024-12-8	Github	Project
VTC-CLS	[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs	ArXiv	2024-12-8	Github	Project
NegToMe	Negative Token Merging: Image-based Adversarial Feature Guidance	ArXiv	2024-12-5	Github	Project
VisionZip	VisionZip: Longer is Better but Not Necessary in Vision Language Models	ArXiv	2024-12-5	Github	Project
AIM	AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning	ArXiv	2024-12-4	Github	Project
Dynamic-LLaVA	Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification	ArXiv	2024-12-3	Github	Project
ATP-LLaVA	ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models	ArXiv	2024-11-30	Github	Project
YOPO	Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See	ArXiv	2024-11-30	Github	Project
DyCoke	DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models	ArXiv	2024-11-22	Github	Project
LLaVA-MR	LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval	ArXiv	2024-11-21	Github	Project
FoPru	FoPru: Focal Pruning for Efficient Large Vision-Language Models	ArXiv	2024-11-21	Github	Project
FocusLLaVA	FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression	ArXiv	2024-11-21	Github	Project
RLT	Don't Look Twice: Faster Video Transformers with Run-Length Tokenization	NeurlPS	2024-11-7	Github	Project
LLaVolta	Efficient Large Multi-modal Models via Visual Context Compression	NeurlPS	2024-11-6	Github	Project
QueCC	Inference Optimal VLMs Need Only One Visual Token but Larger Models	ArXiv	2024-11-5	Github	Project
PyramidDrop	PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction	ArXiv	2024-10-22	Github	Project
Victor	Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers	ArXiv	2024-10-17	Github	Project
AVG-LLaVA	AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity	ArXiv	2024-10-4	Github	Project
TRIM	Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs	COLING	2024-9-28	Github	Project
TokenPacker	TokenPacker: Efficient Visual Projector for Multimodal LLM	ArXiv	2024-8-28	Github	Project
MaVEn	MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model	NeurlPS	2024-8-26	Github	Project
HiRED	HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments	AAAI	2024-8-20	Github	Project
VoCo-LLaMA	VoCo-LLaMA: Towards Vision Compression with Large Language Models	ArXiv	2024-6-18	Github	Project
DeCo	DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models	ArXiv	2024-5-31	Github	Project
LLaVA-PruMerge	LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models	ArXiv	2024-5-22	Github	Project
FastV	An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models	ECCV	2024-5-5	Github	Project
LLaVA-HR	HFeast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models	ArXiv	2024-3-5	Github	Project

Name	Title	Venue	Date	Code	Project
SynthVLM	Synthvlm: High-efficiency and high-quality synthetic data for vision language models	ArXiv	2024-8-10	Github	Project
WolfMLLM	The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative	ArXiv	2024-6-3	Github	Project
AttackMLLM	Synthvlm: High-efficiency and high-quality synthetic data for vision language models	ICLRW	2024-5-16	Github	Project
OODCV	How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	ECCV	2023-11-27	Github	Project
InjectMLLM	(ab) using images and sounds for indirect instruction injection in multi-modal llms	ArXiv	2023-10-3	Github	Project
AdvMLLM	On the Adversarial Robustness of Multi-Modal Foundation Models	ICCVW	2023-8-21	Github	Project