Skip to content

llm-lab-org/Multimodal-RAG-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

98 Commits
ย 
ย 
ย 
ย 

Repository files navigation

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

arXiv Website

This repository is designed to collect and categorize papers related to Multimodal Retrieval-Augmented Generation (RAG) according to our survey paper: Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. Given the rapid growth in this field, we will continuously update both the paper and this repository to serve as a resource for researchers working on future projects.

๐Ÿ“ข News

  • February 17, 2025: We release the first survey for Multimodal Retrieval-Augmented Generation.

  • April 18, 2025: Our website for this topic is up now.

    Feel free to cite, contribute, or open a pull request to add recent related papers!

๐Ÿ“‘ List of Contents


๐Ÿ”Ž General Pipeline

output-onlinepngtools (1)

๐ŸŒฟ Taxonomy of Recent Advances and Enhancements

6634_Ask_in_Any_Modality_A_Com_organized-1-cropped

โš™ Taxonomy of Application Domains

6634_Ask_in_Any_Modality_A_Com_organized-2-cropped

๐Ÿ“ Abstract

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG.

This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases.

๐Ÿ“Š Overview of Popular Datasets

๐Ÿ–ผ Image-Text

Name Statistics and Description Modalities Link
LAION-400M 200M imageโ€“text pairs; used for pre-training multimodal models. Image, Text LAION-400M
Conceptual-Captions (CC) 15M imageโ€“caption pairs; multilingual Englishโ€“German image descriptions. Image, Text Conceptual Captions
CIRR 36,554 triplets from 21,552 images; focuses on natural image relationships. Image, Text CIRR
MS-COCO 330K images with captions; used for caption-to-image and image-to-caption generation. Image, Text MS-COCO
Flickr30K 31K images annotated with five English captions per image. Image, Text Flickr30K
Multi30K 30K German captions from native speakers and human-translated captions. Image, Text Multi30K
NoCaps For zero-shot image captioning evaluation; 15K images. Image, Text NoCaps
Laion-5B 5B imageโ€“text pairs used as external memory for retrieval. Image, Text LAION-5B
COCO-CN 20,341 images for cross-lingual tagging and captioning with Chinese sentences. Image, Text COCO-CN
CIRCO 1,020 queries with an average of 4.53 ground truths per query; for composed image retrieval. Image, Text CIRCO

๐ŸŽž Video-Text

Name Statistics and Description Modalities Link
BDD-X 77 hours of driving videos with expert textual explanations; for explainable driving behavior. Video, Text BDD-X
YouCook2 2,000 cooking videos with aligned descriptions; focused on videoโ€“text tasks. Video, Text YouCook2
ActivityNet 20,000 videos with multiple captions; used for video understanding and captioning. Video, Text ActivityNet
SoccerNet Videos and metadata for 550 soccer games; includes transcribed commentary and key event annotations. Video, Text SoccerNet
MSR-VTT 10,000 videos with 20 captions each; a large video description dataset. Video, Text MSR-VTT
MSVD 1,970 videos with approximately 40 captions per video. Video, Text MSVD
LSMDC 118,081 videoโ€“text pairs from 202 movies; a movie description dataset. Video, Text LSMDC
DiDemo 10,000 videos with four concatenated captions per video; with temporal localization of events. Video, Text DiDemo
Breakfast 1,712 videos of breakfast preparation; one of the largest fully annotated video datasets. Video, Text Breakfast
COIN 11,827 instructional YouTube videos across 180 tasks; for comprehensive instructional video analysis. Video, Text COIN
MSRVTT-QA Video question answering benchmark. Video, Text MSRVTT-QA
MSVD-QA 1,970 video clips with approximately 50.5K QA pairs; video QA dataset. Video, Text MSVD-QA
ActivityNet-QA 58,000 humanโ€“annotated QA pairs on 5,800 videos; benchmark for video QA models. Video, Text ActivityNet-QA
EpicKitchens-100 700 videos (100 hours of cooking activities) for online action prediction; egocentric vision dataset. Video, Text EPIC-KITCHENS-100
Ego4D 4.3M videoโ€“text pairs for egocentric videos; massive-scale egocentric video dataset. Video, Text Ego4D
HowTo100M 136M video clips with captions from 1.2M YouTube videos; for learning textโ€“video embeddings. Video, Text HowTo100M
CharadesEgo 68,536 activity instances from egoโ€“exo videos; used for evaluation. Video, Text Charades-Ego
ActivityNet Captions 20K videos with 3.7 temporally localized sentences per video; dense-captioning events in videos. Video, Text ActivityNet Captions
VATEX 34,991 videos, each with multiple captions; a multilingual video-and-language dataset. Video, Text VATEX
Charades 9,848 video clips with textual descriptions; a multimodal research dataset. Video, Text Charades
WebVid 10M videoโ€“text pairs (refined to WebVid-Refined-1M). Video, Text WebVid
Youku-mPLUG Chinese dataset with 10M videoโ€“text pairs (refined to Youku-Refined-1M). Video, Text Youku-mPLUG

๐Ÿ”Š Audio-Text

Name Statistics and Description Modalities Link
LibriSpeech 1,000 hours of read English speech with corresponding text; ASR corpus based on audiobooks. Audio, Text LibriSpeech
SpeechBrown 55K paired speech-text samples; 15 categories covering diverse topics from religion to fiction. Audio, Text SpeechBrown
AudioCap 46K audio clips paired with human-written text captions. Audio, Text AudioCaps
AudioSet 2M human-labeled sound clips from YouTube across diverse audio event classes (e.g., music or environmental). Audio AudioSet

๐Ÿฉบ Medical

Name Statistics and Description Modalities Link
MIMIC-CXR 125,417 labeled chest X-rays with reports; widely used for medical imaging research. Image, Text MIMIC-CXR
CheXpert 224,316 chest radiographs of 65,240 patients; focused on medical analysis. Image, Text CheXpert
MIMIC-III Health-related data from over 40K patients; includes clinical notes and structured data. Text MIMIC-III
IU-Xray 7,470 pairs of chest X-rays and corresponding diagnostic reports. Image, Text IU-Xray
PubLayNet 100,000 training samples and 2,160 test samples built from PubLayNet for document layout analysis. Image, Text PubLayNet

๐Ÿ‘— Fashion

Name Statistics and Description Modalities Link
Fashion-IQ 77,684 images across three categories; evaluated with Recall@10 and Recall@50 metrics. Image, Text Fashion-IQ
FashionGen 260.5K imageโ€“text pairs of fashion images and item descriptions. Image, Text FashionGen
VITON-HD 83K images for virtual try-on; high-resolution clothing items dataset. Image, Text VITON-HD
Fashionpedia 48,000 fashion images annotated with segmentation masks and fine-grained attributes. Image, Text Fashionpedia
DeepFashion Approximately 800K diverse fashion images for pseudo triplet generation. Image, Text DeepFashion

๐Ÿ’ก QA

Name Statistics and Description Modalities Link
VQA 400K QA pairs with images for visual question-answering tasks. Image, Text VQA
PAQ 65M text-based QA pairs; a large-scale dataset for open-domain QA tasks. Text PAQ
ELI5 270K complex questions augmented with web pages and images; designed for long-form QA tasks. Text ELI5
OK-VQA 14K questions requiring external knowledge for visual question answering tasks. Image, Text OK-VQA
WebQA 46K queries requiring reasoning across text and images; multimodal QA dataset. Text, Image WebQA
Infoseek Fine-grained visual knowledge retrieval using a Wikipedia-based knowledge base (~6M passages). Image, Text Infoseek
ClueWeb22 10 billion web pages organized into subsets; a large-scale web corpus for retrieval tasks. Text ClueWeb22
MOCHEG 15,601 claims annotated with truthfulness labels and accompanied by textual and image evidence. Text, Image MOCHEG
VQA v2 1.1M questions (augmented with VG-QA questions) for fine-tuning VQA models. Image, Text VQA v2
A-OKVQA Benchmark for visual question answering using world knowledge; around 25K questions. Image, Text A-OKVQA
XL-HeadTags 415K news headline-article pairs spanning 20 languages across six diverse language families. Text XL-HeadTags
SEED-Bench 19K multiple-choice questions with accurate human annotations across 12 evaluation dimensions. Text SEED-Bench

๐ŸŒŽ Other

Name Statistics and Description Modalities Link
ImageNet 14M labeled images across thousands of categories; used as a benchmark in computer vision research. Image ImageNet
Oxford Flowers102 Dataset of flowers with 102 categories for fine-grained image classification tasks. Image Oxford Flowers102
Stanford Cars Images of different car models (five examples per model); used for fine-grained categorization tasks. Image Stanford Cars
GeoDE 61,940 images from 40 classes across six world regions; emphasizes geographic diversity in object recognition. Image GeoDE

๐Ÿ“„ Papers

๐Ÿ“š RAG-related Surveys

๐Ÿ‘“ Retrieval Strategies Advances

๐Ÿ” Efficient-Search and Similarity Retrieval

โ“ Maximum Inner Product Search-MIPS
๐Ÿ’ซ Multi-Modal Encoders

๐ŸŽจ Modality-Centric Retrieval

๐Ÿ“‹ Text-Centric
๐Ÿ“ธ Vision-Centric
๐ŸŽฅ Video-Centric
๐Ÿ“ฐ Document Retrieval and Layout Understanding

๐Ÿฅ‡๐Ÿฅˆ Re-ranking Strategies

๐ŸŽฏ Optimized Example Selection
๐Ÿงฎ Relevance Score Evaluation
โณ Filtering Mechanisms

๐Ÿ›  Fusion Mechanisms

๐ŸŽฐ Score Fusion and Alignment

โš” Attention-Based Mechanisms

๐Ÿงฉ Unified Frameworkes

๐Ÿš€ Augmentation Techniques

๐Ÿ’ฐ Context-Enrichment

๐ŸŽก Adaptive and Iterative Retrieval

๐Ÿค– Generation Techniques

๐Ÿง  In-Context Learning

๐Ÿ‘จโ€โš–๏ธ Reasoning

๐Ÿคบ Instruction Tuning

๐Ÿ“‚ Source Attribution and Evidence Transparency

๐Ÿ”ง Training Strategies and Loss Function

๐Ÿ›ก๏ธ Robustness and Noise Management

๐Ÿ›  Taks Addressed by Multimodal RAGs

๐Ÿฉบ Healthcare and Medicine

๐Ÿ’ป Software Engineering

๐Ÿ•ถ๏ธ Fashion and E-Commerce

๐Ÿคน Entertainment and Social Computing

๐Ÿš— Emerging Applications

๐Ÿ“ Evaluation Metrics

๐Ÿ“Š Retrieval Performance

It represents the minimum value between precision (+P) and sensitivity (Se), providing a balanced measure of model performance.

๐Ÿ“ Fluency and Readability

โœ… Relevance and Accuracy

๐Ÿ–ผ๏ธ Image-related Metrics

๐ŸŽต Audio-related Metrics

๐Ÿ”— Text Similarity and Overlap Metrics

๐Ÿ“Š Statistical Metrics

โš™๏ธ Efficiency and Computational Performance

๐Ÿฅ Domain-Specific Metrics


This README is a work in progress and will be completed soon. Stay tuned for more updates!


๐Ÿ”— Citations

If you find our paper or repository useful, please cite the paper:

@misc{abootorabi2025askmodalitycomprehensivesurvey,
      title={Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation}, 
      author={Mohammad Mahdi Abootorabi and Amirhosein Zobeiri and Mahdi Dehghani and Mohammadali Mohammadkhani and Bardia Mohammadi and Omid Ghahroodi and Mahdieh Soleymani Baghshah and Ehsaneddin Asgari},
      year={2025},
      eprint={2502.08826},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08826}, 
}

๐Ÿ“ง Contact

If you have questions, please send an email to mahdi.abootorabi2@gmail.com.

โญ Star History

Star History Chart

Releases

No releases published

Packages

No packages published