MiniCPM-o 4.5
|
Blog |
2026-02-06 |
Github |
Demo |
DeepSeek-OCR 2: Visual Causal Flow
|
DeepSeek |
2026-01-27 |
Github |
- |
| Seed1.8 Model Card: Towards Generalized Real-World Agency |
Bytedance Seed |
2025-12-18 |
- |
- |
| Introducing GPT-5.2 |
OpenAI |
2025-12-11 |
- |
- |
| Introducing Mistral 3 |
Blog |
2025-12-02 |
Huggingface |
- |
Qwen3-VL Technical Report
|
arXiv |
2025-11-26 |
Github |
Demo |
Emu3.5: Native Multimodal Models are World Learners
|
arXiv |
2025-10-30 |
Github |
- |
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
|
arXiv |
2025-10-21 |
Github |
Local Demo |
DeepSeek-OCR: Contexts Optical Compression
|
arXiv |
2025-10-21 |
Github |
- |
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
|
arXiv |
2025-10-17 |
Github |
- |
| NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching |
arXiv |
2025-10-16 |
- |
- |
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue |
arXiv |
2025-10-15 |
Github |
- |
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
|
arXiv |
2025-10-10 |
Github |
- |
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
|
arXiv |
2025-10-09 |
Github |
Demo |
Qwen3-Omni Technical Report
|
arXiv |
2025-09-22 |
Github |
Demo |
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
|
arXiv |
2025-08-27 |
Github |
Demo |
| MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone |
- |
2025-08-26 |
Github |
Demo |
Thyme: Think Beyond Images
|
arXiv |
2025-08-18 |
Github |
Demo |
| Introducing GPT-5 |
OpenAI |
2025-08-07 |
- |
- |
dots.vlm1
|
rednote-hilab |
2025-08-06 |
Github |
Demo |
Step3: Cost-Effective Multimodal Intelligence
|
StepFun |
2025-07-31 |
Github |
Demo |
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
|
arXiv |
2025-07-02 |
Github |
Demo |
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World
|
arXiv |
2025-06-30 |
Github |
- |
| Qwen VLo: From "Understanding" the World to "Depicting" It |
Qwen |
2025-06-26 |
- |
Demo |
MMSearch-R1: Incentivizing LMMs to Search
|
arXiv |
2025-06-25 |
Github |
- |
Show-o2: Improved Native Unified Multimodal Models
|
arXiv |
2025-06-18 |
Github |
- |
| Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities |
Google |
2025-06-17 |
- |
- |
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
|
arXiv |
2025-06-16 |
Github |
- |
MiMo-VL Technical Report
|
arXiv |
2025-06-04 |
Github |
- |
OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
|
arXiv |
2025-05-29 |
Github |
- |
Emerging Properties in Unified Multimodal Pretraining
|
arXiv |
2025-05-23 |
Github |
Demo |
MMaDA: Multimodal Large Diffusion Language Models
|
arXiv |
2025-05-21 |
Github |
Demo |
| UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation |
arXiv |
2025-05-20 |
- |
- |
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
|
arXiv |
2025-05-14 |
Github |
Local Demo |
| Seed1.5-VL Technical Report |
arXiv |
2025-05-11 |
- |
- |
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
|
arXiv |
2025-05-08 |
Github |
- |
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
|
arXiv |
2025-05-06 |
Github |
Local Demo |
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
|
arXiv |
2025-04-23 |
Github |
- |
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
|
arXiv |
2025-04-21 |
Github |
- |
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
|
arXiv |
2025-04-21 |
Github |
- |
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
|
arXiv |
2025-04-14 |
Github |
Demo |
| Introducing GPT-4.1 in the API |
OpenAI |
2025-04-14 |
- |
- |
Kimi-VL Technical Report
|
arXiv |
2025-04-10 |
Github |
Demo |
| The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation |
Meta |
2025-04-05 |
Hugging Face |
- |
Qwen2.5-Omni Technical Report
|
Qwen |
2025-03-26 |
Github |
Demo |
| Addendum to GPT-4o System Card: Native image generation |
OpenAI |
2025-03-25 |
- |
- |
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
|
arXiv |
2025-03-17 |
Github |
- |
| Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision |
arXiv |
2025-03-07 |
- |
- |
| Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs |
arXiv |
2025-03-03 |
Hugging Face |
Demo |
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray
|
arXiv |
2025-02-19 |
Github |
- |
Qwen2.5-VL Technical Report
|
arXiv |
2025-02-19 |
Github |
Demo |
Baichuan-Omni-1.5 Technical Report
|
Tech Report |
2025-01-26 |
Github |
Local Demo |
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
|
arXiv |
2025-01-10 |
Github |
- |
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
|
arXiv |
2025-01-03 |
Github |
- |
QVQ: To See the World with Wisdom
|
Qwen |
2024-12-25 |
Github |
Demo |
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
|
arXiv |
2024-12-13 |
Github |
- |
| Apollo: An Exploration of Video Understanding in Large Multimodal Models |
arXiv |
2024-12-13 |
- |
- |
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
|
arXiv |
2024-12-12 |
Github |
Local Demo |
| StreamChat: Chatting with Streaming Video |
arXiv |
2024-12-11 |
Coming soon |
- |
| CompCap: Improving Multimodal Large Language Models with Composite Captions |
arXiv |
2024-12-06 |
- |
- |
LinVT: Empower Your Image-level Large Language Model to Understand Videos
|
arXiv |
2024-12-06 |
Github |
- |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
|
arXiv |
2024-12-06 |
Github |
Demo |
NVILA: Efficient Frontier Visual Language Models
|
arXiv |
2024-12-05 |
Github |
Demo |
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
|
arXiv |
2024-12-04 |
Github |
- |
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
|
arXiv |
2024-11-27 |
Github |
- |
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
|
arXiv |
2024-11-27 |
Github |
Local Demo |
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
|
arXiv |
2024-10-22 |
Github |
Demo |
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
|
arXiv |
2024-10-09 |
Github |
- |
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
|
arXiv |
2024-10-04 |
Github |
Local Demo |
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
|
CVPR |
2024-09-26 |
Github |
Demo |
| Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models |
arXiv |
2024-09-25 |
Huggingface |
Demo |
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
|
arXiv |
2024-09-18 |
Github |
Demo |
ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
|
ICLR |
2024-09-05 |
Github |
Local Demo |
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
|
arXiv |
2024-09-04 |
Github |
- |
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
|
arXiv |
2024-08-28 |
Github |
Demo |
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
|
arXiv |
2024-08-28 |
Github |
- |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
|
arXiv |
2024-08-09 |
Github |
- |
VITA: Towards Open-Source Interactive Omni Multimodal LLM
|
arXiv |
2024-08-09 |
Github |
- |
LLaVA-OneVision: Easy Visual Task Transfer
|
arXiv |
2024-08-06 |
Github |
Demo |
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
|
arXiv |
2024-08-03 |
Github |
Demo |
| VILA^2: VILA Augmented VILA |
arXiv |
2024-07-24 |
- |
- |
| SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models |
arXiv |
2024-07-22 |
- |
- |
| EVLM: An Efficient Vision-Language Model for Visual Understanding |
arXiv |
2024-07-19 |
- |
- |
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
|
arXiv |
2024-07-10 |
Github |
- |
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
|
arXiv |
2024-07-03 |
Github |
Demo |
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
|
arXiv |
2024-06-27 |
Github |
Local Demo |
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
|
AAAI |
2024-06-27 |
Github |
- |
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
|
arXiv |
2024-06-24 |
Github |
Local Demo |
Long Context Transfer from Language to Vision
|
arXiv |
2024-06-24 |
Github |
Local Demo |
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
|
ICML |
2024-06-22 |
Github |
- |
TroL: Traversal of Layers for Large Language and Vision Models
|
EMNLP |
2024-06-18 |
Github |
Local Demo |
Unveiling Encoder-Free Vision-Language Models
|
arXiv |
2024-06-17 |
Github |
Local Demo |
VideoLLM-online: Online Video Large Language Model for Streaming Video
|
CVPR |
2024-06-17 |
Github |
Local Demo |
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
|
CoRL |
2024-06-15 |
Github |
Demo |
Comparison Visual Instruction Tuning
|
arXiv |
2024-06-13 |
Github |
Local Demo |
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
|
arXiv |
2024-06-12 |
Github |
- |
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
|
arXiv |
2024-06-11 |
Github |
Local Demo |
Parrot: Multilingual Visual Instruction Tuning
|
arXiv |
2024-06-04 |
Github |
- |
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
|
arXiv |
2024-05-31 |
Github |
- |
Matryoshka Query Transformer for Large Vision-Language Models
|
arXiv |
2024-05-29 |
Github |
Demo |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
|
arXiv |
2024-05-24 |
Github |
- |
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
|
arXiv |
2024-05-24 |
Github |
Demo |
Libra: Building Decoupled Vision System on Large Language Models
|
ICML |
2024-05-16 |
Github |
Local Demo |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
|
arXiv |
2024-05-09 |
Github |
Local Demo |
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
|
arXiv |
2024-04-25 |
Github |
Demo |
Graphic Design with Large Multimodal Model
|
arXiv |
2024-04-22 |
Github |
- |
| BRAVE: Broadening the visual encoding of vision-language models |
ECCV |
2024-04-10 |
- |
- |
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
|
arXiv |
2024-04-09 |
Github |
Demo |
| Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs |
arXiv |
2024-04-08 |
- |
- |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
|
CVPR |
2024-04-08 |
Github |
- |
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
|
NeurIPS |
2024-04-04 |
Github |
Local Demo |
| TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model |
ACM TKDD |
2024-03-28 |
- |
- |
LITA: Language Instructed Temporal-Localization Assistant |
arXiv |
2024-03-27 |
Github |
Local Demo |
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
|
arXiv |
2024-03-27 |
Github |
Demo |
| MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training |
arXiv |
2024-03-14 |
- |
- |
MoAI: Mixture of All Intelligence for Large Language and Vision Models
|
arXiv |
2024-03-12 |
Github |
Local Demo |
DeepSeek-VL: Towards Real-World Vision-Language Understanding
|
arXiv |
2024-03-08 |
Github |
Demo |
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
|
arXiv |
2024-03-07 |
Github |
Demo |
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World |
arXiv |
2024-02-29 |
Github |
- |
| GROUNDHOG: Grounding Large Language Models to Holistic Segmentation |
CVPR |
2024-02-26 |
Coming soon |
Coming soon |
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
|
arXiv |
2024-02-19 |
Github |
- |
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
|
arXiv |
2024-02-18 |
Github |
- |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
|
arXiv |
2024-02-18 |
Github |
Demo |
CoLLaVO: Crayon Large Language and Vision mOdel
|
arXiv |
2024-02-17 |
Github |
- |
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
|
ICML |
2024-02-12 |
Github |
- |
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
|
arXiv |
2024-02-06 |
Github |
- |
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
|
arXiv |
2024-02-06 |
Github |
- |
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
|
NeurIPS |
2024-02-03 |
Github |
- |
| Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study |
arXiv |
2024-01-31 |
Coming soon |
- |
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge |
Blog |
2024-01-30 |
Github |
Demo |
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
|
arXiv |
2024-01-29 |
Github |
Demo |
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
|
arXiv |
2024-01-29 |
Github |
Demo |
Yi-VL
|
- |
2024-01-23 |
Github |
Local Demo |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities |
arXiv |
2024-01-22 |
- |
- |
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
|
ACL |
2024-01-04 |
Github |
Local Demo |
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
|
arXiv |
2023-12-28 |
Github |
- |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
|
CVPR |
2023-12-21 |
Github |
Demo |
Osprey: Pixel Understanding with Visual Instruction Tuning
|
CVPR |
2023-12-15 |
Github |
Demo |
CogAgent: A Visual Language Model for GUI Agents
|
arXiv |
2023-12-14 |
Github |
Coming soon |
| Pixel Aligned Language Models |
arXiv |
2023-12-14 |
Coming soon |
- |
VILA: On Pre-training for Visual Language Models
|
CVPR |
2023-12-13 |
Github |
Local Demo |
| See, Say, and Segment: Teaching LMMs to Overcome False Premises |
arXiv |
2023-12-13 |
Coming soon |
- |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
|
ECCV |
2023-12-11 |
Github |
Demo |
Honeybee: Locality-enhanced Projector for Multimodal LLM
|
CVPR |
2023-12-11 |
Github |
- |
| Gemini: A Family of Highly Capable Multimodal Models |
Google |
2023-12-06 |
- |
- |
OneLLM: One Framework to Align All Modalities with Language
|
arXiv |
2023-12-06 |
Github |
Demo |
Lenna: Language Enhanced Reasoning Detection Assistant
|
arXiv |
2023-12-05 |
Github |
- |
| VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding |
arXiv |
2023-12-04 |
- |
- |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
|
arXiv |
2023-12-04 |
Github |
Local Demo |
Making Large Multimodal Models Understand Arbitrary Visual Prompts
|
CVPR |
2023-12-01 |
Github |
Demo |
Dolphins: Multimodal Language Model for Driving
|
arXiv |
2023-12-01 |
Github |
- |
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
|
arXiv |
2023-11-30 |
Github |
Coming soon |
VTimeLLM: Empower LLM to Grasp Video Moments
|
arXiv |
2023-11-30 |
Github |
Local Demo |
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
|
arXiv |
2023-11-30 |
Github |
- |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
|
arXiv |
2023-11-28 |
Github |
Coming soon |
LLMGA: Multimodal Large Language Model based Generation Assistant
|
arXiv |
2023-11-27 |
Github |
Demo |
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
|
arXiv |
2023-11-27 |
Github |
- |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
|
arXiv |
2023-11-21 |
Github |
Demo |
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
|
arXiv |
2023-11-20 |
Github |
- |
An Embodied Generalist Agent in 3D World
|
arXiv |
2023-11-18 |
Github |
Demo |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
|
arXiv |
2023-11-16 |
Github |
Demo |
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
|
CVPR |
2023-11-14 |
Github |
- |
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
|
arXiv |
2023-11-13 |
Github |
- |
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
|
arXiv |
2023-11-13 |
Github |
Demo |
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
|
CVPR |
2023-11-11 |
Github |
Demo |
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
|
arXiv |
2023-11-09 |
Github |
Demo |
NExT-Chat: An LMM for Chat, Detection and Segmentation
|
arXiv |
2023-11-08 |
Github |
Local Demo |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
|
arXiv |
2023-11-07 |
Github |
Demo |
OtterHD: A High-Resolution Multi-modality Model
|
arXiv |
2023-11-07 |
Github |
- |
| CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding |
arXiv |
2023-11-06 |
Coming soon |
- |
GLaMM: Pixel Grounding Large Multimodal Model
|
CVPR |
2023-11-06 |
Github |
Demo |
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
|
arXiv |
2023-11-02 |
Github |
- |
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
|
arXiv |
2023-10-14 |
Github |
Local Demo |
SALMONN: Towards Generic Hearing Abilities for Large Language Models
|
ICLR |
2023-10-20 |
Github |
- |
Ferret: Refer and Ground Anything Anywhere at Any Granularity
|
arXiv |
2023-10-11 |
Github |
- |
CogVLM: Visual Expert For Large Language Models
|
arXiv |
2023-10-09 |
Github |
Demo |
Improved Baselines with Visual Instruction Tuning
|
arXiv |
2023-10-05 |
Github |
Demo |
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
|
ICLR |
2023-10-03 |
Github |
Demo |
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs |
arXiv |
2023-10-01 |
Github |
- |
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
|
arXiv |
2023-10-01 |
Github |
Local Demo |
| AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model |
arXiv |
2023-09-27 |
- |
- |
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
|
arXiv |
2023-09-26 |
Github |
Local Demo |
DreamLLM: Synergistic Multimodal Comprehension and Creation
|
ICLR |
2023-09-20 |
Github |
Coming soon |
| An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models |
arXiv |
2023-09-18 |
Coming soon |
- |
TextBind: Multi-turn Interleaved Multimodal Instruction-following
|
arXiv |
2023-09-14 |
Github |
Demo |
NExT-GPT: Any-to-Any Multimodal LLM
|
arXiv |
2023-09-11 |
Github |
Demo |
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
|
arXiv |
2023-09-13 |
Github |
- |
ImageBind-LLM: Multi-modality Instruction Tuning
|
arXiv |
2023-09-07 |
Github |
Demo |
| Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning |
arXiv |
2023-09-05 |
- |
- |
PointLLM: Empowering Large Language Models to Understand Point Clouds
|
arXiv |
2023-08-31 |
Github |
Demo |
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
|
arXiv |
2023-08-31 |
Github |
Local Demo |
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
|
arXiv |
2023-08-25 |
Github |
- |
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
|
arXiv |
2023-08-25 |
Github |
Demo |
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
|
arXiv |
2023-08-24 |
Github |
Demo |
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
|
ICLR |
2023-08-23 |
Github |
Demo |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
|
arXiv |
2023-08-20 |
Github |
- |
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
|
arXiv |
2023-08-19 |
Github |
Demo |
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
|
arXiv |
2023-08-08 |
Github |
- |
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
|
ICLR |
2023-08-03 |
Github |
Demo |
LISA: Reasoning Segmentation via Large Language Model
|
arXiv |
2023-08-01 |
Github |
Demo |
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
|
arXiv |
2023-07-31 |
Github |
Local Demo |
3D-LLM: Injecting the 3D World into Large Language Models
|
arXiv |
2023-07-24 |
Github |
- |
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
|
arXiv |
2023-07-18 |
- |
Demo |
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
|
arXiv |
2023-07-17 |
Github |
Demo |
SVIT: Scaling up Visual Instruction Tuning
|
arXiv |
2023-07-09 |
Github |
- |
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
|
arXiv |
2023-07-07 |
Github |
Demo |
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
|
arXiv |
2023-07-05 |
Github |
- |
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
|
arXiv |
2023-07-04 |
Github |
Demo |
Visual Instruction Tuning with Polite Flamingo
|
arXiv |
2023-07-03 |
Github |
Demo |
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
|
arXiv |
2023-06-29 |
Github |
Demo |
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
|
arXiv |
2023-06-27 |
Github |
Demo |
MotionGPT: Human Motion as a Foreign Language
|
arXiv |
2023-06-26 |
Github |
- |
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
|
arXiv |
2023-06-15 |
Github |
Coming soon |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
|
arXiv |
2023-06-11 |
Github |
Demo |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
|
arXiv |
2023-06-08 |
Github |
Demo |
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
|
arXiv |
2023-06-08 |
Github |
Demo |
| M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning |
arXiv |
2023-06-07 |
- |
- |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
|
arXiv |
2023-06-05 |
Github |
Demo |
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
|
arXiv |
2023-06-01 |
Github |
- |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
|
arXiv |
2023-05-30 |
Github |
Demo |
PandaGPT: One Model To Instruction-Follow Them All
|
arXiv |
2023-05-25 |
Github |
Demo |
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
|
arXiv |
2023-05-25 |
Github |
- |
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
|
arXiv |
2023-05-24 |
Github |
Local Demo |
DetGPT: Detect What You Need via Reasoning
|
arXiv |
2023-05-23 |
Github |
Demo |
Pengi: An Audio Language Model for Audio Tasks
|
NeurIPS |
2023-05-19 |
Github |
- |
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
|
arXiv |
2023-05-18 |
Github |
- |
Listen, Think, and Understand
|
arXiv |
2023-05-18 |
Github |
Demo |
VisualGLM-6B
|
- |
2023-05-17 |
Github |
Local Demo |
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
|
arXiv |
2023-05-17 |
Github |
- |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
|
arXiv |
2023-05-11 |
Github |
Local Demo |
VideoChat: Chat-Centric Video Understanding
|
arXiv |
2023-05-10 |
Github |
Demo |
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
|
arXiv |
2023-05-08 |
Github |
Demo |
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
|
arXiv |
2023-05-07 |
Github |
- |
LMEye: An Interactive Perception Network for Large Language Models
|
arXiv |
2023-05-05 |
Github |
Local Demo |
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
|
arXiv |
2023-04-28 |
Github |
Demo |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
|
arXiv |
2023-04-27 |
Github |
Demo |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
|
arXiv |
2023-04-20 |
Github |
- |
Visual Instruction Tuning
|
NeurIPS |
2023-04-17 |
GitHub |
Demo |
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
|
ICLR |
2023-03-28 |
Github |
Demo |
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
|
ACL |
2022-12-21 |
Github |
- |