Vision language models (VLMs) have emerged as powerful tools for learning unified embedding spaces that integrate vision and language. Inspired by the success of large language models (LLMs), which have demonstrated remarkable reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining significant attention for developing both general-purpose and specialized VLMs.
In this repository, we provide a comprehensive summary of the current literature from an application-oriented perspective. We hope this resource serves as a valuable reference for the VLLM research community.
If you are interested in this project, you can contribute to this repo by pulling requests 😊😊😊
🚀 What's New in This Update:
- [2025.3.10]: 🔥 Adding three papers on complex reasoning, efficiency and face!
- [2025.3.6]: 🔥 Adding one paper on complex reasoning!
- [2025.3.2]: 🔥 Adding two projects on complex reasoning: R1-V and VLM-R1!
- [2025.2.23]: 🔥 Adding one video-to-action paper and one vision-to-text paper!
- [2025.2.1]: 🔥 Adding four video-to-text papers!
- [2025.1.22]: 🔥 Adding one video-to-text paper!
- [2025.1.17]: 🔥 Adding three video-to-text papers, thanks for the contributions from Enxin!
- [2025.1.14]: 🔥 Adding two complex reasoning papers and one video-to-text paper!
- [2025.1.13]: 🔥 Adding one VFM survey paper!
- [2025.1.12]: 🔥 Adding one efficient MLLM paper!
- [2025.1.9]: 🔥🔥🔥 Adding one efficient MLLM survey!
- [2025.1.7]: 🔥🔥🔥 Our survey paper is released! Please check this link for more information. We add more tool management papers in our paper list.
- [2025.1.6]: 🔥 We add one OS Agent survey paper in our paper list, and a new category: complex reasoning!
- [2025.1.4]: 🔥 We updated the general domain and egocentric video papers in our paper list, thanks for the contributions from Wentao!
- [2025.1.2]: 🔥 We add more interpretation papers to our paper list, thanks for the contributions from Ruoyu!
- [2024.12.15]: 🔥 We release our VLLM application paper list repo!
- Visual Large Language Models for Generalized and Specialized Applications
Title | Venue | Date | Code | Project |
---|---|---|---|---|
Foundation Models Defining a New Era in Vision: A Survey and Outlook |
T-PAMI | 2025-1-9 | Github | Project |
Vision-Language Models for Vision Tasks: A Survey |
T-PAMI | 2024-8-8 | Github | Project |
Vision + Language Applications: A Survey |
CVPRW | 2023-5-24 | Github | Project |
Vision-and-Language Pretrained Models: A Survey |
IJCAI (survey track) | 2022-5-3 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
EchoSight | EchoSight: Advancing Visual-Language Models with Wiki Knowledge |
EMNLP | 2024-07-17 | Github | Project |
FROMAGe | Grounding Language Models to Images for Multimodal Inputs and Outputs |
ICML | 2024-01-31 | Github | Project |
Wiki-LLaVA | Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs | CVPR | 2023-04-23 | Github | Project |
UniMuR | Unified Embeddings for Multimodal Retrieval via Frozen LLMs | ICML | 2019-05-08 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
Graphist | Graphic Design with Large Multimodal Model |
ArXiv | 2024-04-22 | Github | Project |
Ferret-UI | Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs |
ECCV | 2024-04-08 | Github | Project |
CogAgent | CogAgent: A Visual Language Model for GUI Agents |
CVPR | 2023-12-21 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
FinTral | FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models |
ACL | 2024-06-14 | Github | Project |
FinVis-GPT | FinVis-GPT: A Multimodal Large Language Model for Financial Chart Analysis |
ArXiv | 2023-07-31 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
Video-LLaVA | Video-llava: Learning united visual representation by alignment before projection |
EMNLP | 2024-10-01 | Github | Project |
BT-Adapter | BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning |
CVPR | 2024-06-27 | Github | Project |
VideoGPT+ | VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding |
arXiv | 2024-06-13 | Github | Project |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models |
ACL | 2024-06-10 | Github | Project |
MVBench | MVBench: A Comprehensive Multi-modal Video Understanding Benchmark |
CVPR | 2024-05-23 | Github | Project |
LVChat | LVCHAT: Facilitating Long Video Comprehension |
ArXiv | 2024-02-19 | Github | Project |
VideoChat | VideoChat: Chat-Centric Video Understanding |
ArXiv | 2024-01-04 | Github | Project |
Valley | Valley: Video Assistant with Large Language model Enhanced abilitY |
ArXiv | 2023-10-08 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
PALM | PALM: Predicting Actions through Language Models |
CVPR Workshop | 2024-07-18 | Github | Project |
GPT4Ego | GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition | ArXiv | 2024-05-11 | Github | Project |
AntGPT | AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? |
ICLR | 2024-04-01 | Github | Project |
LEAP | LEAP: LLM-Generation of Egocentric Action Programs | ArXiv | 2023-11-29 | Github | Project |
LLM-Inner-Speech | Egocentric Video Comprehension via Large Language Model Inner Speech |
CVPR Workshop | 2023-06-18 | Github | Project |
LLM-Brain | LLM as A Robotic Brain: Unifying Egocentric Memory and Control | ArXiv | 2023-04-25 | Github | Project |
LaViLa | Learning Video Representations from Large Language Models |
CVPR | 2022-12-08 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
DriveLM | DriveLM: Driving with Graph Visual Question Answering |
ECCV | 2024-7-17 | Github | Project |
Talk2BEV | Talk2BEV: Language-enhanced Bird’s-eye View Maps for Autonomous Driving |
ICRA | 2024-5-13 | Github | Project |
Nuscenes-QA | TNuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario |
AAAI | 2024-3-24 | Github | Project |
DriveMLM | DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving |
ArXiv | 2023-12-25 | Github | Project |
LiDAR-LLM | LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding |
CoRR | 2023-12-21 | Github | Project |
Dolphis | Dolphins: Multimodal Language Model for Driving |
ArXiv | 2023-12-1 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
DriveGPT4 | DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model |
RAL | 2024-8-7 | Github | Project |
SurrealDriver | SurrealDriver: Designing LLM-powered Generative Driver Agent Framework based on Human Drivers’ Driving-thinking Data |
ArXiv | 2024-7-22 | Github | Project |
DriveVLM | DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models |
CoRL | 2024-6-25 | Github | Project |
DiLu | DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models |
ICLR | 2024-2-22 | Github | Project |
LMDrive | LMDrive: Closed-Loop End-to-End Driving with Large Language Models |
CVPR | 2023-12-21 | Github | Project |
GPT-Driver | DGPT-Driver: Learning to Drive with GPT |
NeurlPS Workshop | 2023-12-5 | Github | Project |
ADriver-I | ADriver-I: A General World Model for Autonomous Driving |
ArXiv | 2023-11-22 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
Seena | Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving |
ArXiv | 2024-10-29 | Github | Project |
BEV-InMLLM | Holistic autonomous driving understanding by bird’s-eye-view injected multi-Modal large model |
CVPR | 2024-1-2 | Github | Project |
Prompt4Driving | Language Prompt for Autonomous Driving |
ArXiv | 2023-9-8 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
Wonderful-Team | Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs |
ArXiv | 2024-12-4 | Github | Project |
AffordanceLLM | AffordanceLLM: Grounding Affordance from Vision Language Models |
CVPR | 2024-4-17 | Github | Project |
3DVisProg | Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding |
CVPR | 2024-3-23 | Github | Project |
WREPLAN | REPLAN: Robotic Replanning with Perception and Language Models |
ArXiv | 2024-2-20 | Github | Project |
PaLM-E | PaLM-E: An Embodied Multimodal Language Model |
ICML | 2023-3-6 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
OpenVLA | OpenVLA: An Open-Source Vision-Language-Action Model |
ArXiv | 2024-9-5 | Github | Project |
LLARVA | LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning |
CoRL | 2024-6-17 | Github | Project |
RT-X | Open X-Embodiment: Robotic Learning Datasets and RT-X Models |
ArXiv | 2024-6-1 | Github | Project |
RoboFlamingo | Vision-Language Foundation Models as Effective Robot Imitators |
ICLR | 2024-2-5 | Github | Project |
VoxPoser | VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models |
CoRL | 2023-11-2 | Github | Project |
ManipLLM | ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation |
CVPR | 2023-12-24 | Github | Project |
RT-2 | RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control |
ArXiv | 2023-7-28 | Github | Project |
Instruct2Act | Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model |
ArXiv | 2023-5-24 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
LLaRP | Large Language Models as Generalizable Policies for Embodied Tasks |
ICLR | 2024-4-16 | Github | Project |
MP5 | MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception |
CVPR | 2024-3-24 | Github | Project |
LL3DA | LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning |
CVPR | 2023-11-30 | Github | Project |
EmbodiedGPT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought |
NeurlPS | 2023-11-2 | Github | Project |
ELLM | Guiding Pretraining in Reinforcement Learning with Large Language Models |
ICML | 2023-9-15 | Github | Project |
3D-LLM | 3D-LLM: Injecting the 3D World into Large Language Models |
NeurlPS | 2023-7-24 | Github | Project |
NLMap | Open-vocabulary Queryable Scene Representations for Real World Planning |
ICRA | 2023-7-4 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
ConceptGraphs | ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning |
ICRA | 2024-5-13 | Github | Project |
RILA | RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation |
CVPR | 2024-4-27 | Github | Project |
EMMA | Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld |
CVPR | 2024-3-29 | Github | Project |
VLN-VER | Volumetric Environment Representation for Vision-Language Navigation |
CVPR | 2024-3-24 | Github | Project |
MultiPLY | MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World |
CVPR | 2024-1-16 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
3DGPT | 3D-GPT: Procedural 3D Modeling with Large Language Models |
ArXiv | 2024-5-29 | GitHub | Project |
Holodeck | Holodeck: Language Guided Generation of 3D Embodied AI Environments |
CVPR | 2024-4-22 | GitHub | Project |
LLMR | LLMR: Real-time Prompting of Interactive Worlds using Large Language Models |
ACM CHI | 2024-3-22 | GitHub | Project |
GPT4Point | GPT4Point: A Unified Framework for Point-Language Understanding and Generation |
ArXiv | 2023-12-1 | GitHub | Project |
ShapeGPT | ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model |
ArXiv | 2023-12-1 | GitHub | Project |
MeshGPT | MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers |
ArXiv | 2023-11-27 | GitHub | Project |
LI3D | Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback | NeurlPS | 2023-5-26 | GitHub | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
Emotion-LLaMA | Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning |
arXiv | 2024-11-2 | Github | Project |
Face-MLLM | Face-MLLM: A Large Face Perception Model | arXiv | 2024-10-28 | Github | Project |
ExpLLM | ExpLLM: Towards Chain of Thought for Facial Expression Recognition | arXiv | 2024-9-4 | Github | Project |
EMO-LLaMA | EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning |
arXiv | 2024-8-21 | Github | Project |
EmoLA | Facial Affective Behavior Analysis with Instruction Tuning |
ECCV | 2024-7-12 | Github | Project |
EmoLLM | EmoLLM: Multimodal Emotional Understanding Meets Large Language Models |
ArXiv | 2024-6-29 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
HAWK | HAWK: Learning to Understand Open-World Video Anomalies |
NeurlPS | 2024-5-27 | Github | Project |
CUVA | Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly |
CVPR | 2024-5-6 | Github | Project |
LAVAD | Harnessing Large Language Models for Training-free Video Anomaly Detectiong |
CVPR | 2024-4-1 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
SynthVLM | Synthvlm: High-efficiency and high-quality synthetic data for vision language models |
ArXiv | 2024-8-10 | Github | Project |
WolfMLLM | The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative |
ArXiv | 2024-6-3 | Github | Project |
AttackMLLM | Synthvlm: High-efficiency and high-quality synthetic data for vision language models | ICLRW | 2024-5-16 | Github | Project |
OODCV | How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs |
ECCV | 2023-11-27 | Github | Project |
InjectMLLM | (ab) using images and sounds for indirect instruction injection in multi-modal llms |
ArXiv | 2023-10-3 | Github | Project |
AdvMLLM | On the Adversarial Robustness of Multi-Modal Foundation Models | ICCVW | 2023-8-21 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
MM-EUREKA | MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning |
Github | 2025-3-7 | Github | Project |
Visual-RFT | Visual-RFT: Visual Reinforcement Fine-Tuning |
ArXiv | 2025-3-3 | Github | Project |
VLM-R1 | VLM-R1: A stable and generalizable R1-style Large Vision-Language Model |
None | 2025-2-15 | Github | Project |
R1-V | R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 |
Blog | 2025-2-3 | Github | Project |
LlamaV-o1 | LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs |
ArXiv | 2025-1-10 | Github | Project |
Virgo | Virgo: A Preliminary Exploration on Reproducing o1-like MLLM |
ArXiv | 2025-1-3 | Github | Project |
Mulberry | Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search |
ArXiv | 2024-12-31 | Github | Project |
LLaVA-CoT | LLaVA-CoT: Let Vision Language Models Reason Step-by-Step |
ArXiv | 2024-11-25 | Github | Project |
Thanks to all the contributors!