A Multi-modality Fusion Paper Reading List
- Multimodal Learning with Transformers : A Survey. Peng Xu, Xiatian Zhu, David A. Clifton. ArXiv 2022. **Summary : ** Survey transformer based methods for MML
- Multimodal Intelligence: Representation Learning,Information Fusion, and Applications. Chao Zhang, Zichao Yang, Xiaodong He, Li Deng. ArXiv 2019. **Summary : ** Survey deep learning methods for MML
- Multimodal Machine Learning:A Survey and Taxonomy.Tadas Baltruˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. ArXiv 2017. **Summary : ** A classical survey of MML, given five challenges of MML.
- Multi-modal Alignment using Representation Codebook. Jiali Duan, Liqun Chen, Son Tran, Jinyu Yang, Yi Xu, Belinda Zeng, Trishul Chilimbi. CVPR 2022.
- Vision-Language Pre-Training with Triple Contrastive Learning. Jinyu Yang, Jiali Duan, Son Tran, Yi Xu , Sampath Chanda, Liqun Chen , Belinda Zeng , Trishul Chilimbi , and Junzhou Huang. CVPR 2022. **Summary : ** Proposed three contrastive learning objectives for VLP, for intra-modal and inter-modal modeling. Contribution: Further considers intra-modal supervision in turn benefits cross-modal.
- An Empirical Study of Training End-to-End Vision-and-Language Transformers. Zi-Yi Dou, Yichong Xu, Zhe Gan,Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, Michael Zeng CVPR 2022 **Summary : ** discuss the model designs along multiple dimensions: visual encoder, text encoder, model architecture, pre-training objectives and find a best design Contribution: present a VLP METER,
- VL-BEIT: Generative Vision-Language Pretraining. Hangbo Bao , Wenhui Wang , Li Dong, Furu Wei. ArXiv 2022. **Summary : ** present a VLP model with pretrain objectives masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images. Contribution: present a VLP VL-BEIT and show masked image modeling pretrain objectives is effective.
- Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. Junnan Li, Ramprasaath R. Selvaraju, Akhilesh D. Gotmare Shafiq Joty, Caiming Xiong, Steven C.H. Hoi. NIPS 2021.
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. Wonjae Kim, Bokyung Son, Ildoo Kim. ICLM 2021.
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee. NIPS 2019.
- VisualBERT: A Simple and Performant Baseline for Vision and Language. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. ArXiv 2019.
- UNITER: UNiversal Image-TExt Representation Learning. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu. ECCV 2019.
-
Attention Bottlenecks for Multimodal Fusion. Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun .NIPS 2021.
-
ViViT: A Video Vision Transformer. Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, Cordelia Schmid.ICCV 2021.
-
Multiview Transformers for Video Recognition. Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid.CVPR 2022.
-
Is Space-Time Attention All You Need for Video Understanding?. Gedas Bertasius, Heng Wang, Lorenzo Torresani. ICML 2021. **Summary : ** Contribution: First convolution-free approach to video classification
- VideoBERT: A Joint Model for Video and Language Representation Learning. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. ICCV 2019.