A foundation model is a large-scale pretrained model (e.g., BERT, DALL-E, GPT-3) that can be adapted to a wide range of downstream applications. This term was first popularized by the Stanford Institute for Human-Centered Artificial Intelligence. This repository maintains a curated list of foundation models for vision and language tasks. Research papers without code are not included.
- Image Segmentation in Foundation Model Era: A Survey (from Beijing Institute of Technology)
- Towards Vision-Language Geo-Foundation Model: A Survey (from Nanyang Technological University)
- An Introduction to Vision-Language Modeling (from Meta)
- The Evolution of Multimodal Model Architectures (from Purdue University)
- Efficient Multimodal Large Language Models: A Survey (from Tencent)
- Foundation Models for Video Understanding: A Survey (from Aalborg University)
- Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond (from GigaAI)
- Prospective Role of Foundation Models in Advancing Autonomous Vehicles (from Tongji University)
- Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey (from Northeastern University)
- A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models (from Lehigh)
- Large Multimodal Agents: A Survey (from CUHK)
- The Uncanny Valley: A Comprehensive Analysis of Diffusion Models (from Mila)
- Real-World Robot Applications of Foundation Models: A Review (from University of Tokyo)
- From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities (from Shanghai AI Lab)
- Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey (from JHU)
- Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision (from SDSU)
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants (from Microsoft)
- Towards Generalist Foundation Model for Radiology (from SJTU)
- Foundational Models Defining a New Era in Vision: A Survey and Outlook (from MBZ University of AI)
- Towards Generalist Biomedical AI (from Google)
- A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models (from Oxford)
- Large Multimodal Models: Notes on CVPR 2023 Tutorial (from Chunyuan Li, Microsoft)
- A Survey on Multimodal Large Language Models (from USTC and Tencent)
- Vision-Language Models for Vision Tasks: A Survey (from Nanyang Technological University)
- Foundation Models for Generalist Medical Artificial Intelligence (from Stanford)
- A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
- A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
- Vision-language pre-training: Basics, recent advances, and future trends
- On the Opportunities and Risks of Foundation Models (This survey first popularizes the concept of foundation model; from Standford)
- [10/30] Reward Centering (from Richard Sutton, University of Alberta)
- [10/21] Long Term Memory : The Foundation of AI Self-Evolution (from Tianqiao and Chrissy Chen Institute)
- [10/10] Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations (from CUHK)
- [10/04] Movie Gen: A Cast of Media Foundation Models (from Meta)
- [10/02] Were RNNs All We Needed? (from Mila)
- [10/01] nGPT: Normalized Transformer with Representation Learning on the Hypersphere (from Nvidia)
- [09/30] MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning (from Apple)
- [09/27] Emu3: Next-Token Prediction is All You Need (from BAAI)
- [09/25] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models (from Allen AI)
- [09/18] Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution (from Alibaba)
- [09/18] Moshi: a speech-text foundation model for real-time dialogue (from Kyutai)
- [08/27] Diffusion Models Are Real-Time Game Engines (from Google)
- [08/22] Sapiens: Foundation for Human Vision Models (from Meta)
- [08/14] Imagen 3 (from Google Deepmind)
- [07/31] The Llama 3 Herd of Models (from Meta)
- [07/29] SAM 2: Segment Anything in Images and Videos (from Meta)
- [07/24] PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects (from HUST and ByteDance)
- [07/17] EVE: Unveiling Encoder-Free Vision-Language Models (from BAAI)
- [07/12] Transformer Layers as Painters (from Sakana AI)
- [06/24] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (from NYU)
- [06/13] 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (from EPFL and Apple)
- [06/10] Merlin: A Vision Language Foundation Model for 3D Computed Tomography (from Stanford. Code will be available.)
- [06/06] Vision-LSTM: xLSTM as Generic Vision Backbone (from LSTM authors)
- [05/31] MeshXL: Neural Coordinate Field for Generative 3D Foundation Models (from Fudan)
- [05/25] MoEUT: Mixture-of-Experts Universal Transformers (from Stanford)
- [05/22] Attention as an RNN (from Mila & Borealis AI)
- [05/22] GigaPath: A whole-slide foundation model for digital pathology from real-world data (from Nature)
- [05/21] BiomedParse: a biomedical foundation model for biomedical image parsing (from Microsoft)
- [05/20] Octo: An Open-Source Generalist Robot Policy (from UC Berkeley)
- [05/17] Observational Scaling Laws and the Predictability of Language Model Performance (fro Standford)
- [05/14] Understanding the performance gap between online and offline alignment algorithms (from Google)
- [05/09] Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (from Shanghai AI Lab)
- [05/08] You Only Cache Once: Decoder-Decoder Architectures for Language Models
- [05/06] Advancing Multimodal Medical Capabilities of Gemini (from Google)
- [05/07] xLSTM: Extended Long Short-Term Memory (from Sepp Hochreiter, the author of LSTM.)
- [05/03] Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
- [04/30] KAN: Kolmogorov-Arnold Networks (Promising alternatives of MLPs. from MIT)
- [04/26] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (InternVL 1.5. from Shanghai AI Lab)
- [04/14] TransformerFAM: Feedback attention is working memory (from Google. Efficient attention.)
- [04/10] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (from Google)
- [04/02] Octopus v2: On-device language model for super agent (from Stanford)
- [04/02] Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (from Google)
- [03/22] InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding (from Shanghai AI Lab)
- [03/18] Arc2Face: A Foundation Model of Human Faces (from Imperial College London)
- [03/14] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (30B parameters. from Apple)
- [03/09] uniGradICON: A Foundation Model for Medical Image Registration (from UNC-Chapel Hill)
- [03/05] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3. from Stability AI)
- [03/01] Learning and Leveraging World Models in Visual Representation Learning (from Meta)
- [03/01] VisionLLaMA: A Unified LLaMA Interface for Vision Tasks (from Meituan)
- [02/28] CLLMs: Consistency Large Language Models (from SJTU)
- [02/27] Transparent Image Layer Diffusion using Latent Transparency (from Standford)
- [02/22] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (from Meta)
- [02/21] Beyond A∗: Better Planning with Transformers via Search Dynamics Bootstrapping (from Meta)
- [02/20] Neural Network Diffusion (Generating network parameters via diffusion models. from NUS)
- [02/20] VideoPrism: A Foundational Visual Encoder for Video Understanding (from Google)
- [02/19] FiT: Flexible Vision Transformer for Diffusion Model (from Shanghai AI Lab)
- [02/06] MobileVLM V2: Faster and Stronger Baseline for Vision Language Model (from Meituan)
- [01/30] YOLO-World: Real-Time Open-Vocabulary Object Detection (from Tencent and HUST)
- [01/23] Lumiere: A Space-Time Diffusion Model for Video Generation (from Google)
- [01/22] CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation (from Stanford)
- [01/19] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (from TikTok)
- [01/16] SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers (from NYU)
- [01/15] InstantID: Zero-shot Identity-Preserving Generation in Seconds (from Xiaohongshu)
- BioCLIP: A Vision Foundation Model for the Tree of Life (CVPR 2024 best student paper)
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Mamba appears to outperform similarly-sized Transformers while scaling linearly with sequence length. from CMU)
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (from NVIDIA)
- Tracking Everything Everywhere All at Once (from Cornell, ICCV 2023 best student paper)
- Foundation Models for Generalist Geospatial Artificial Intelligence (from IBM and NASA)
- LLaMA 2: Open Foundation and Fine-Tuned Chat Models (from Meta)
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (from Shanghai AI Lab)
- The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World (from Shanghai AI Lab)
- Meta-Transformer: A Unified Framework for Multimodal Learning (from CUHK and Shanghai AI Lab)
- Retentive Network: A Successor to Transformer for Large Language Models (from Microsoft and Tsinghua University)
- Neural World Models for Computer Vision (PhD Thesis of Anthony Hu from University of Cambridge)
- Recognize Anything: A Strong Image Tagging Model (a strong foundation model for image tagging. from OPPO)
- Towards Visual Foundation Models of Physical Scenes (describes a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion; from AWS)
- LIMA: Less Is More for Alignment (65B parameters, from Meta)
- PaLM 2 Technical Report (from Google)
- IMAGEBIND: One Embedding Space To Bind Them All (from Meta)
- Visual Instruction Tuning (LLaVA, from U of Wisconsin-Madison and Microsoft)
- SEEM: Segment Everything Everywhere All at Once (from University of Wisconsin-Madison, HKUST, and Microsoft)
- SAM: Segment Anything (the first foundation model for image segmentation; from Meta)
- SegGPT: Segmenting Everything In Context (from BAAI, ZJU, and PKU)
- Images Speak in Images: A Generalist Painter for In-Context Visual Learning (from BAAI, ZJU, and PKU)
- UniDector: Detecting Everything in the Open World: Towards Universal Object Detection (CVPR, from Tsinghua and BNRist)
- Unmasked Teacher: Towards Training-Efficient Video Foundation Models (from Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai AI Laboratory)
- Visual Prompt Multi-Modal Tracking (from Dalian University of Technology and Peng Cheng Laboratory)
- Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks (from ByteDance)
- EVA-CLIP: Improved Training Techniques for CLIP at Scale (from BAAI and HUST)
- EVA-02: A Visual Representation for Neon Genesis (from BAAI and HUST)
- EVA-01: Exploring the Limits of Masked Visual Representation Learning at Scale (CVPR, from BAAI and HUST)
- LLaMA: Open and Efficient Foundation Language Models (A collection of foundation language models ranging from 7B to 65B parameters; from Meta)
- The effectiveness of MAE pre-pretraining for billion-scale pretraining (from Meta)
- BloombergGPT: A Large Language Model for Finance (50 billion parameters; from Bloomberg)
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (this work was coordinated by BigScience whose goal is to democratize LLMs.)
- FLIP: Scaling Language-Image Pre-training via Masking (from Meta)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (from Saleforce Research)
- GPT-4 Technical Report (from OpenAI)
- Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (from Microsoft Research Asia)
- UNINEXT: Universal Instance Perception as Object Discovery and Retrieval (a unified model for 10 instance perception tasks; CVPR, from ByteDance)
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning (from Shanghai AI Lab)
- InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions (CVPR, from Shanghai AI Lab)
- BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning (from Harbin Institute of Technology and Microsoft Research Asia)
- BEVT: BERT Pretraining of Video Transformers (CVPR, from Shanghai Key Lab of Intelligent Information Processing)
- Foundation Transformers (from Microsoft)
- A Generalist Agent (known as Gato, a multi-modal, multi-task, multi-embodiment generalist agent; from DeepMind)
- FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (from Microsoft, UCLA, and New York University)
- Flamingo: a Visual Language Model for Few-Shot Learning (from DeepMind)
- MetaLM: Language Models are General-Purpose Interfaces (from Microsoft)
- Point-E: A System for Generating 3D Point Clouds from Complex Prompts (efficient 3D object generation using a text-to-image diffusion model; from OpenAI)
- Image Segmentation Using Text and Image Prompts (CVPR, from University of Göttingen)
- Unifying Flow, Stereo and Depth Estimation (A unified model for three motion and 3D perception tasks; from ETH Zurich)
- PaLI: A Jointly-Scaled Multilingual Language-Image Model (from Google)
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training (NeurIPS, from Nanjing University, Tencent, and Shanghai AI Lab)
- SLIP: Self-supervision meets Language-Image Pre-training (ECCV, from UC Berkeley and Meta)
- GLIPv2: Unifying Localization and VL Understanding (NeurIPS'22, from UW, Meta, Microsoft, and UCLA)
- GLIP: Grounded Language-Image Pre-training (CVPR, from UCLA and Microsoft)
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (from Salesforce Research)
- NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis (from Microsoft)
- PaLM: Scaling Language Modeling with Pathways (from Google)
- CoCa: Contrastive Captioners are Image-Text Foundation Models (from Google)
- Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (from Google)
- A Unified Sequence Interface for Vision Tasks (from Google Research, Brain Team)
- Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (from Google)
- Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models (CVPR, from Stability and Runway)
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (BIG-Bench: a 204-task extremely difficult and diverse benchmark for LLMs, 444 authors from 132 institutions)
- CRIS: CLIP-Driven Referring Image Segmentation (from University of Sydney and OPPO)
- Masked Autoencoders As Spatiotemporal Learners (extension of MAE to videos; NeurIPS, from Meta)
- Masked Autoencoders Are Scalable Vision Learners (CVPR 2022, from FAIR)
- InstructGPT: Training language models to follow instructions with human feedback (trained with humans in the loop; from OpenAI)
- A Unified Sequence Interface for Vision Tasks (NeurIPS 2022, from Google)
- DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents (from OpenAI)
- Robust and Efficient Medical Imaging with Self-Supervision (from Google, Georgia Tech, and Northwestern University)
- Video Swin Transformer (CVPR, from Microsoft Research Asia)
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework (ICML 2022. from Alibaba.)
- Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation (CVPR 2022, from FAIR and UIUC)
- FLAVA: A Foundational Language And Vision Alignment Model (CVPR, from Facebook AI Research)
- Towards artificial general intelligence via a multimodal foundation model (Nature Communication, from Renmin University of China)
- FILIP: Fine-Grained Interactive Language-Image Pre-Training (ICLR, from Huawei and HKUST)
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision (ICLR, from CMU and Google)
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models (from OpenAI)
- Unifying Vision-and-Language Tasks via Text Generation (from UNC-Chapel Hill)
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (PMLR, from Google)
- UniT: Multimodal Multitask Learning with a Unified Transformer (ICCV, from FAIR)
- WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training (This paper presents the first large-scale Chinese multimodal pre-training model called BriVL; from Renmin University of China)
- Codex: Evaluating Large Language Models Trained on Code (a GPT language model finetuned on public code from GitHub, from OpenAI and Anthropic AI)
- Florence: A New Foundation Model for Computer Vision (from Microsoft)
- DALL-E: Zero-Shot Text-to-Image Generation (from OpenAI)
- CLIP: Learning Transferable Visual Models From Natural Language Supervision (from OpenAI)
- Multimodal Few-Shot Learning with Frozen Language Models (NeurIPS, from DeepMind)
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV, from Microsoft Research Asia)
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (The first Vision Transfomer with pure self-attention blocks; ICLR, from Google)
- GPT-3: Language Models are Few-Shot Learners (175B parameters; permits in-context learning compared with GPT-2; from OpenAI)
- UNITER: UNiversal Image-TExt Representation Learning (from Microsoft)
- T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (from Google)
- GPT-2: Language Models are Unsupervised Multitask Learners (1.5B parameters; from OpenAI)
- LXMERT: Learning Cross-Modality Encoder Representations from Transformers (EMNLP, from UNC-Chapel Hill)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (from Google AI Language)
- GPT: Improving Language Understanding by Generative Pre-Training (from OpenAI)
- Attention Is All You Need (NeurIPS, from Google and UoT)
- LLaVA: Visual Instruction Tuning (from University of Wisconsin-Madison)
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models (from KAUST)
- GPT-4 Technical Report (from OpenAI)
- GPT-3: Language Models are Few-Shot Learners (175B parameters; permits in-context learning compared with GPT-2; from OpenAI)
- GPT-2: Language Models are Unsupervised Multitask Learners (1.5B parameters; from OpenAI)
- GPT: Improving Language Understanding by Generative Pre-Training (from OpenAI)
- LLaMA 2: Open Foundation and Fine-Tuned Chat Models (from Meta)
- LLaMA: Open and Efficient Foundation Language Models (models ranging from 7B to 65B parameters; from Meta)
- T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (from Google)
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI (from Shanghai AI Lab, 2024)
- BLINK: Multimodal Large Language Models Can See but Not Perceive (multimodal benchmark. from University of Pennsylvania, 2024)
- CAD-Estate: Large-scale CAD Model Annotation in RGB Videos (RGB videos with CAD annotation. from Google 2023)
- ImageNet: A Large-Scale Hierarchical Image Database (vision benchmark. from Stanford, 2009)
- FLIP: Scaling Language-Image Pre-training via Masking (from Meta)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (proposes a generic and efficient VLP strategy based on off-the-shelf frozen vision and language models. from Salesforce Research)
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (from Salesforce Research)
- SLIP: Self-supervision meets Language-Image Pre-training (ECCV, from UC Berkeley and Meta)
- GLIP: Grounded Language-Image Pre-training (CVPR, from UCLA and Microsoft)
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (PMLR, from Google)
- RegionCLIP: Region-Based Language-Image Pretraining
- CLIP: Learning Transferable Visual Models From Natural Language Supervision (from OpenAI)
- SAM 2: Segment Anything in Images and Videos (from Meta)
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (from NVIDIA)
- SEEM: Segment Everything Everywhere All at Once (from University of Wisconsin-Madison, HKUST, and Microsoft)
- SAM: Segment Anything (the first foundation model for image segmentation; from Meta)
- SegGPT: Segmenting Everything In Context (from BAAI, ZJU, and PKU)
- Green AI (introduces the concept of Red AI vs Green AI)
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (the lottery ticket hypothesis, from MIT)
- Bounding the probability of harm from an AI to create a guardrail (blog from Yoshua Bengio)
- Managing Extreme AI Risks amid Rapid Progress (from Science, May 2024)