Skip to content

Latest commit

 

History

History
933 lines (732 loc) · 54.6 KB

File metadata and controls

933 lines (732 loc) · 54.6 KB

Methods Summary of Large Multi-Modality Model

Catalogue

*Large Language Model*

(arXiv2018_GPT) Improving Language Understanding by Generative Pre-Training.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever.
[paper] [code]

(NAACL2019_BERT) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova.
[paper] [code]

(arXiv2019_GPT-2) Language Models are Unsupervised Multitask Learners.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever.
[paper] [code]

(NeurIPS2019_UniLM) Unified Language Model Pre-training for Natural Language Understanding and Generation.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon.
[paper] [code]

(NeurIPS2019_XLNet) XLNet: Generalized Autoregressive Pretraining for Language Understanding.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
[paper] [code]

(ICML2020_UniLMv2) UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training.
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon.
[paper] [code]

(arXiv2020_GPT-3) Language Models are Few-Shot Learners.
OpenAI Team.
[paper] [code]

(arXiv2022_PaLM) PaLM: Scaling Language Modeling with Pathways.
Google Research.
[paper] [code]

(arXiv2023_LLaMA) LLaMA: Open and Efficient Foundation Language Models.
Meta Team.
[paper] [code]

(arXiv2023_RWKV) RWKV: Reinventing RNNs for the Transformer Era.
RWKV Team.
[paper] [code]

(arXiv2023_LLM-Judge) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica.
[paper] [code]

(arXiv2023_RETNET) Retentive Network: A Successor to Transformer for Large Language Models.
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei.
[paper] [code]

(arXiv2023_Llama 2) Llama 2: Open Foundation and Fine-Tuned Chat Models.
Meta Team.
[paper] [code]

(arXiv2023_InternLM) InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities.
InternLM Team.
[paper] [code]

(arXiv2023_Qwen) Qwen Technical Report.
Qwen Team.
[paper] [code]

(arXiv2023_LightSeq) LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers.
Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, Hao Zhang.
[paper] [code]

(arXiv2023_Mamba) Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
Albert Gu, Tri Dao.
[paper] [code]

(arXiv2024_Mixtral) Mixtral of Experts.
Mistral.AI.
[paper] [code]

(arXiv2024_Phi-3) Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.
Microsoft Team.
[paper]

*Large Vision Model*

(ICLR2021_ViT) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
[paper] [code]

(ICCV2021_ViViT) ViViT: A Video Vision Transformer.
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
[paper] [code]

(arXiv2021_MLP-Mixer) MLP-Mixer: An all-MLP Architecture for Vision.
Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy.
[paper] [code]

(ICLR2022_BEiT) BEiT: BERT Pre-Training of Image Transformers.
Hangbo Bao, Li Dong, Songhao Piao, Furu Wei.
[paper] [code]

(CVPR2022_MAE) Masked Autoencoders Are Scalable Vision Learners.
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
[paper] [code]

(ECCV2022_MVP) MVP: Multimodality-guided Visual Pre-training.
Longhui Wei, Lingxi Xie, Wengang Zhou, Houqiang Li, Qi Tian.
[paper]

(arXiv2022_BEiTv2) BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers.
Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei.
[paper] [code]

(ICLR2023_ToME) Token Merging: Your ViT But Faster.
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman.
[paper] [code]

(CVPR2023_EVA) EVA: Exploring the Limits of Masked Visual Representation Learning at Scale.
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao.
[paper] [code]

(CVPR2023_Painter) Images Speak in Images: A Generalist Painter for In-Context Visual Learning.
Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, Tiejun Huang.
[paper] [code]

(CVPR2023_MAGVIT) MAGVIT: Masked Generative Video Transformer.
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang.
[paper] [code]

(ICML2023_ViT-22B) Scaling Vision Transformers to 22 Billion Parameters.
Google Research.
[paper]

(arXiv2023_EVA-02) EVA-02: A Visual Representation for Neon Genesis.
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao.
[paper] [code]

(arXiv2023_EVA-CLIP) EVA-CLIP: Improved Training Techniques for CLIP at Scale.
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao.
[paper] [code]

(CVPR2023_VideoMAEv2) VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking.
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao.
[paper] [code]

(ICCV2023_SegGPT) SegGPT: Segmenting Everything In Context.
Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang.
[paper] [code]

(ICLR2024_MAGVITv2) Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation.
Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang.
[paper]

(CVPR2024_LVM) Sequential Modeling Enables Scalable Learning for Large Vision Models.
Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros.
[paper] [code]

(arXiv2024_AIM) Scalable Pre-training of Large Autoregressive Image Models.
Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand Joulin.
[paper] [code]

(arXiv2024_VIM) Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model.
Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang.
[paper] [code]

(arXiv2024_EVA-CLIP-18B) EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters.
Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang.
[paper] [code]

(arXiv2024_VisionLLaMA) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks.
Xiangxiang Chu, Jianlin Su, Bo Zhang, Chunhua Shen.
[paper] [code]

(arXiv2024_Vision-RWKV) Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures.
Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang.
[paper] [code]

(arXiv2024_VideoMamba) VideoMamba: State Space Model for Efficient Video Understanding.
Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, Yu Qiao.
[paper] [code]

(arXiv2024_VAR) Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang.
[paper] [code]

(arXiv2024_Ctrl-Adapter) Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model.
Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal.
[paper] [code]

*Large Region Multimodal Model*

(NeurIPS2023_VisionLLM) VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks.
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai.
[paper] [code]

(NeurIPS2023_RECODE) Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models.
Lin Li, Jun Xiao, Guikun Chen, Jian Shao, Yueting Zhuang, Long Chen.
[paper] [code]

(arXiv2023_DetGPT) DetGPT: Detect What You Need via Reasoning.
Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, Lingpeng Kong, Tong Zhang.
[paper] [code]

(arXiv2023_DAC) Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models.
Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky.
[paper]

(arXiv2023_Kosmos-2) Kosmos-2: Grounding Multimodal Large Language Models to the World.
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
[paper] [code]

(arXiv2023_BuboGPT) BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs.
Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, Bingyi Kang.
[paper] [code]

(arXiv2023_ChatSpot) ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning.
Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang.
[paper]

(CVPR2024_LISA) LISA: Reasoning Segmentation via Large Language Model.
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia.
[paper] [code]

(CVPR2024_LLM4SGG) LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation.
Kibum Kim, Kanghoon Yoon, Jaehyeong Jeon, Yeonjun In, Jinyoung Moon, Donghyun Kim, Chanyoung Park.
[paper] [code]

(arXiv2023_SoM) Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V.
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao.
[paper] [code]

(CVPR2024_GLaMM) GLaMM: Pixel Grounding Large Multimodal Model.
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan.
[paper] [code]

(CVPR2024_LION) LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge.
Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie.
[paper] [code]

(arXiv2023_PG-Video-LLaVA) PG-Video-LLaVA: Pixel Grounding Large Video-Language Models.
Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan.
[paper] [code]

(arXiv2023_DINOv) Visual In-Context Prompting.
Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao.
[paper] [code]

(arXiv2023_TAP) Tokenize Anything via Prompting.
Ting Pan, Lulu Tang, Xinlong Wang, Shiguang Shan.
[paper] [code]

(CVPR2024_Emu2) Generative Multimodal Models are In-Context Learners.
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang.
[paper] [code]

(arXiv2024_VisionLLMv2) VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks.
Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Wenhai Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, Jifeng Dai.
[paper] [code]

*Large Image Multimodal Model*

(NeurIPS2022_Flamingo) Flamingo: a Visual Language Model for Few-Shot Learning.
DeepMind Team.
[paper] [code]

(CVPR2023_BEiTv3) Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks.
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei.
[paper] [code]

(ICCV2023_DiT) Scalable Diffusion Models with Transformers.
William Peebles, Saining Xie.
[paper] [code]

(ICML2023_mPLUG-2) mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video.
Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou.
[paper] [code]

(ICCV2023_ControlNet) Adding Conditional Control to Text-to-Image Diffusion Models.
Lvmin Zhang, Anyi Rao, Maneesh Agrawala.
[paper] [code]

(arXiv2023_Kosmos-1) Language Is Not All You Need: Aligning Perception with Language Models.
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei.
[paper] [code]

(arXiv2023_PaLM-E) PaLM-E: An Embodied Multimodal Language Model.
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence.
[paper] [code]

(arXiv2023_Visual-ChatGPT) Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models.
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan.
[paper] [code]

(CVPR2023_GigaGAN) Scaling up GANs for Text-to-Image Synthesis.
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, Taesung Park.
[paper] [code]

(ICCV2023_ViperGPT) ViperGPT: Visual Inference via Python Execution for Reasoning.
Dídac Surís, Sachit Menon, Carl Vondrick.
[paper] [code]

(arXiv2023_MM-REACT) MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action.
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang.
[paper] [code]

(arXiv2023_LLaMA-Adapter) LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.
Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Yu Qiao.
[paper] [code]

(NeurIPS2023_HuggingGPT) HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang.
[paper] [code]

(NeurIPS2023_LLaVA) Visual Instruction Tuning.
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee.
[paper] [code]

(arXiv2023_MiniGPT-4) MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny.
[paper] [code]

(arXiv2023_mPLUG-Owl) mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality.
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou.
[paper] [code]

(arXiv2023_LLaMA-AdapterV2) LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model.
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao.
[paper] [code]

(arXiv2023_Otter) Otter: A Multi-Modal Model with In-Context Instruction Tuning.
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu.
[paper] [code]

(arXiv2023_MultiModal-GPT) MultiModal-GPT: A Vision and Language Model for Dialogue with Humans.
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, Kai Chen.
[paper] [code]

(arXiv2023_InternGPT) InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language.
Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, Limin Wang, Ping Luo, Jifeng Dai, Yu Qiao.
[paper] [code]

(NeurIPS2023_InstructBLIP) InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
[paper] [code]

(EMNLP2023_IdealGPT) IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models.
Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang.
[paper] [code]

(NeurIPS2023_LaVIN) Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models.
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji.
[paper] [code]

(arXiv2023_PandaGPT) PandaGPT: One Model To Instruction-Follow Them All.
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, Deng Cai.
[paper] [code]

(NeurIPS2023_GILL) Generating Images with Multimodal Language Models.
Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov.
[paper] [code]

(NeurIPS2023_GPT4Tools) GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction.
Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan.
[paper] [code]

(arXiv2023_MIMIC-IT) MIMIC-IT: Multi-Modal In-Context Instruction Tuning.
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, Ziwei Liu.
[paper] [code]

(AAAI2024_MotionGPT) MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators.
Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, Wanli Ouyang.
[paper] [code]

(ICLR2023_Emu) Generative Pretraining in Multimodality.
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang.
[paper] [code]

(Blog2023_IDEFICS) Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model.
Hugo Laurençon, Daniel van Strien, Stas Bekman, Leo Tronchon, Lucile Saulnier, Thomas Wang, Siddharth Karamcheti, Amanpreet Singh, Giada Pistilli, Yacine Jernite, Victor Sanh.
[blog]

(AAAI2024_BLIVA) BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions.
Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu.
[paper] [code]

(arXiv2023_Qwen-VL) Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou.
[paper] [code]

(ICML2024_NExT-GPT) NExT-GPT: Any-to-Any Multimodal LLM.
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua.
[paper] [code]

(ACL2024_TextBind) TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild.
Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi.
[paper] [code]

(arXiv2023_Kosmos-2.5) Kosmos-2.5: A Multimodal Literate Model.
Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, Furu Wei.
[paper] [code]

(ICLR2024_DreamLLM) DreamLLM: Synergistic Multimodal Comprehension and Creation.
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi.
[paper] [code]

(arXiv2023_InternLM-XComposer) InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition.
Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang.
[paper] [code]

(CVPR2024_LLaVA1.5) Improved Baselines with Visual Instruction Tuning.
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee.
[paper] [code]

(arXiv2023_OpenLEAF) OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation.
Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo.
[paper]

(arXiv2023_COMM) From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models.
Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin'e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, Hongkai Xiong.
[paper] [code]

(arXiv2023_Open X-Embodiment) Open X-Embodiment: Robotic Learning Datasets and RT-X Models.
Open X-Embodiment Collaboration.
[paper] [code]

(arXiv2023_MiniGPT-v2) MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-Task Learning.
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny.
[paper] [code]

(arXiv2023_Woodpecker) Woodpecker: Hallucination Correction for Multimodal Large Language Models.
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, Enhong Chen.
[paper] [code]

(CVPR2024_CapsFusion) CapsFusion: Rethinking Image-Text Data at Scale.
Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, Jingjing Liu.
[paper] [code]

(Blog2023_Fuyu-8B) Fuyu-8B: A Multimodal Architecture for AI Agents.
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.
[blog]

(Blog2024_Fuyu-Heavy) Fuyu-Heavy: A New Multimodal Model.
Adept Team.
[blog]

(arXiv2023_CogVLM) CogVLM: Visual Expert for Pretrained Language Models.
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang.
[paper] [code]

(arXiv2023_OtterHD) OtterHD: A High-Resolution Multi-modality Model.
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu.
[paper] [code]

(arXiv2023_mPLUG-Owl2) mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration.
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou.
[paper] [code]

(CVPR2024_Monkey) Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models.
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai.
[paper] [code]

(arXiv2023_LVIS-Instruct4V) To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning.
Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang.
[paper] [code]

(arXiv2023_ShareGPT4V) ShareGPT4V: Improving Large Multi-Modal Models with Better Captions.
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin.
[paper] [code]

(CVPR2024_Powers-of-Ten) Generative Powers of Ten.
Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steve Seitz, Ira Kemelmacher, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, Aleksander Holynski.
[paper] [code]

(CVPR2024_OneLLM) OneLLM: One Framework to Align All Modalities with Language.
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue.
[paper] [code]

(CVPR2024_Honeybee) Honeybee: Locality-enhanced Projector for Multimodal LLM.
Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh.
[paper] [code]

(arXiv2023_CogAgent) CogAgent: A Visual Language Model for GUI Agents.
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang.
[paper] [code]

(CVPR2024_InternVL) InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks.
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai.
[paper] [code]

(arXiv2023_MobileVLM) MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices.
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen.
[paper] [code]

(arXiv2024_InternLM-XComposer2) InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang.
[paper] [code]

(arXiv2024_CoBSAT) Can MLLMs Perform Text-to-Image In-Context Learning?.
Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee.
[paper] [code]

(arXiv2024_MobileVLM-V2) MobileVLM V2: Faster and Stronger Baseline for Vision Language Model.
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen.
[paper] [code]

(ICML2024_Prismatic-VLMs) Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models.
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, Dorsa Sadigh.
[paper] [code]

(arXiv2024_Bunny) Efficient Multimodal Learning from Data-centric Perspective.
Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, Bo Zhao.
[paper] [code]

(arXiv2024_DeepSeek-VL) DeepSeek-VL: Towards Real-World Vision-Language Understanding.
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, Chong Ruan.
[paper] [code]

(arXiv2024_LLaVA-UHD) LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images.
Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang.
[paper] [code]

(arXiv2024_S2) When Do We Not Need Larger Vision Models?.
Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell.
[paper] [code]

(arXiv2024_LLaVA-PruMerge) LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models.
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan.
[paper] [code]

(arXiv2024_Mini-Gemini) Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models.
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia.
[paper] [code]

(Blog2024_Idefics2) Introducing Idefics2: A Powerful 8B Vision-Language Model for the community.
Leo Tronchon, Hugo Laurençon, Victor Sanh.
[blog]

(arXiv2024_InternLM-XComposer2-4KHD) InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD.
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang.
[paper] [code]

(arXiv2024_BRAVE) BRAVE: Broadening the visual encoding of vision-language models.
Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari.
[paper] [code]

(arXiv2024_InternVL1.5) How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites.
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang.
[paper] [code]

(arXiv2024_MANTIS) MANTIS: Interleaved Multi-Image Instruction Tuning.
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen.
[paper] [code]

(arXiv2024_Lumina-T2X) Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers.
Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li.
[paper] [code]

(arXiv2024_Cambrian-1) Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs.
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, Saining Xie.
[paper] [code]

(Blog2024_LLaVA-NeXT) LLaVA-NeXT-series.
[blog]

(Blog2024_LMMS-Eval) Accelerating the Development of Large Multimodal Models with LMMs-Eval.
[blog]

(arXiv2024_LlamaGen) Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation.
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan.
[paper] [code]

(arXiv2024_MAR) Autoregressive Image Generation without Vector Quantization.
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, Kaiming He.
[paper]

(arXiv2024_EVE) Unveiling Encoder-Free Vision-Language Models.
Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang.
[paper] [code]

*Large Video Multimodal Model*

(arXiv2022_InternVideo) InternVideo: General Video Foundation Models via Generative and Discriminative Learning.
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao.
[paper] [code]

(arXiv2022_VideoCoCa) VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners.
Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu.
[paper]

(arXiv2023_VideoChat) VideoChat: Chat-Centric Video Understanding.
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, Yu Qiao.
[paper] [code]

(EMNLP2023_Video-LLaMA) Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding.
Hang Zhang, Xin Li, Lidong Bing.
[paper] [code]

(arXiv2023_Video-ChatGPT) Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models.
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan.
[paper] [code]

(arXiv2023_Valley) Valley: Video Assistant with Large Language model Enhanced abilitY.
Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, Zhongyu Wei.
[paper] [code]

(CVPR2024_MovieChat) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding.
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang.
[paper] [code]

(EMNLP2023_TESTA) TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding.
Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou.
[paper] [code]

(CVPR2024_Chat-UniVi) Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding.
Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, Li Yuan.
[paper] [code]

(arXiv2023_VideoChat2) MVBench: A Comprehensive Multi-modal Video Understanding Benchmark.
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, Yu Qiao.
[paper] [code]

(arXiv2023_LLaMA-VID) LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models.
Yanwei Li, Chengyao Wang, Jiaya Jia.
[paper] [code]

(arXiv2024_Video-LaVIT) Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization.
Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu.
[paper] [code]

(arXiv2024_LSTP) LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding.
Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Zilong Zheng.
[paper] [code]

(arXiv2024_ShareGPT4Video) ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang.
[paper] [code]

*Large Model Distillation*

(EMNLP2016_Seq-KD) Sequence-Level Knowledge Distillation.
Yoon Kim, Alexander M. Rush.
[paper] [code]

(arXiv2020_ImitKD) Autoregressive Knowledge Distillation through Imitation Learning.
Alexander Lin, Jeremy Wohlwend, Howard Chen, Tao Lei.
[paper] [code]

(ICLR2024_MINILLM) MINILLM: Knowledge Distillation of Large Language Models.
Yuxian Gu, Li Dong, Furu Wei, Minlie Huang.
[paper] [code]

(ICLR2024_GKD) On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes.
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem.
[paper] [code]

(ACL2023_f-DISTILL) f-Divergence Minimization for Sequence-Level Knowledge Distillation.
Yuqiao Wen, Zichao Li, Wenyu Du, Lili Mou.
[paper] [code]

(arXiv2023_DistillSpec) DistillSpec: Improving Speculative Decoding via Knowledge Distillation.
Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal.
[paper]

(arXiv2023_MiniMA) Towards the Law of Capacity Gap in Distilling Language Models.
Chen Zhang, Dawei Song, Zheyu Ye, Yan Gao.
[paper] [code]

(arXiv2024_Self-Rewarding) Self-Rewarding Language Models.
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston.
[paper]

*Related Survey*

(arXiv2020_Survey) Efficient Transformers: A Survey.
Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler.
[paper]

(arXiv2023_Survey) A Survey of Large Language Models.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen.
[paper]

(arXiv2023_Survey) Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models.
Lingxi Xie, Longhui Wei, Xiaopeng Zhang, Kaifeng Bi, Xiaotao Gu, Jianlong Chang, Qi Tian.
[paper]

(arXiv2023_Survey) A Survey on Multimodal Large Language Models.
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen.
[paper] [code]

(arXiv2023_Survey) Multimodal Foundation Models: From Specialists to General-Purpose Assistants.
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao.
[paper] [code]

(arXiv2023_Survey) A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future.
Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, Ting Liu.
[paper] [code]

(CVPR2023w_Survey) Recent Advances in Vision Foundation Models.
Linjie Li, Zhe Gan, Chunyuan Li, Jianwei Yang, Zhengyuan Yang, Jianfeng Gao, Lijuan Wang.
[paper]

(arXiv2023_Survey) A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise.
Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hongsheng Li, Xing Sun.
[paper] [code]

*Related Benchmark*

(NeurIPS2023_LAMM) LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark.
Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang.
[paper] [code]

(arXiv2023_MME) MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji.
[paper] [code]

(arXiv2023_MMBench) MMBench: Is Your Multi-modal Model an All-around Player?.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin.
[paper] [code]

(arXiv2023_SEED-Bench) SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension.
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan.
[paper] [code]

(arXiv2023_MagnifierBench) OtterHD: A High-Resolution Multi-modality Model.
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu.
[paper] [code]

(arXiv2023_Video-Bench) Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models.
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan.
[paper] [code]

(arXiv2023_MVBench) MVBench: A Comprehensive Multi-modal Video Understanding Benchmark.
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, Yu Qiao.
[paper] [code]

(arXiv2023_SEED-Bench-2) SEED-Bench-2: Benchmarking Multimodal Large Language Models.
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan.
[paper] [code]

(arXiv2023_VBench) VBench: Comprehensive Benchmark Suite for Video Generative Models.
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu.
[paper] [code]

(arXiv2024_VL-ICL) VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning.
Yongshuo Zong, Ondrej Bohdal, Timothy Hospedales.
[paper] [code]

(arXiv2024_SEED-Bench-2-Plus) SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension.
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan.
[paper] [code]

(arXiv2024_Video-MME) Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis.
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun.
[paper] [code]