Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

This repository is designed to collect and categorize papers related to Multimodal Retrieval-Augmented Generation (RAG) according to our survey paper: Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. Given the rapid growth in this field, we will continuously update both the paper and this repository to serve as a resource for researchers working on future projects.

📢 News

February 17, 2025: We release the first survey for Multimodal Retrieval-Augmented Generation.
April 18, 2025: Our website for this topic is up now.

Feel free to cite, contribute, or open a pull request to add recent related papers!

📑 List of Contents

🔎 General Pipeline
🌿 Taxonomy of Recent Advances and Enhancements
⚙ Taxonomy of Application Domains
📝 Abstract
📊 Overview of Popular Datasets
📄 Papers
🔗 Citations
📧 Contact

🔎 General Pipeline

🌿 Taxonomy of Recent Advances and Enhancements

⚙ Taxonomy of Application Domains

📝 Abstract

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG.

This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases.

📊 Overview of Popular Datasets

🖼 Image-Text

Name	Statistics and Description	Modalities	Link
LAION-400M	200M image–text pairs; used for pre-training multimodal models.	Image, Text	LAION-400M
Conceptual-Captions (CC)	15M image–caption pairs; multilingual English–German image descriptions.	Image, Text	Conceptual Captions
CIRR	36,554 triplets from 21,552 images; focuses on natural image relationships.	Image, Text	CIRR
MS-COCO	330K images with captions; used for caption-to-image and image-to-caption generation.	Image, Text	MS-COCO
Flickr30K	31K images annotated with five English captions per image.	Image, Text	Flickr30K
Multi30K	30K German captions from native speakers and human-translated captions.	Image, Text	Multi30K
NoCaps	For zero-shot image captioning evaluation; 15K images.	Image, Text	NoCaps
Laion-5B	5B image–text pairs used as external memory for retrieval.	Image, Text	LAION-5B
COCO-CN	20,341 images for cross-lingual tagging and captioning with Chinese sentences.	Image, Text	COCO-CN
CIRCO	1,020 queries with an average of 4.53 ground truths per query; for composed image retrieval.	Image, Text	CIRCO

🎞 Video-Text

Name	Statistics and Description	Modalities	Link
BDD-X	77 hours of driving videos with expert textual explanations; for explainable driving behavior.	Video, Text	BDD-X
YouCook2	2,000 cooking videos with aligned descriptions; focused on video–text tasks.	Video, Text	YouCook2
ActivityNet	20,000 videos with multiple captions; used for video understanding and captioning.	Video, Text	ActivityNet
SoccerNet	Videos and metadata for 550 soccer games; includes transcribed commentary and key event annotations.	Video, Text	SoccerNet
MSR-VTT	10,000 videos with 20 captions each; a large video description dataset.	Video, Text	MSR-VTT
MSVD	1,970 videos with approximately 40 captions per video.	Video, Text	MSVD
LSMDC	118,081 video–text pairs from 202 movies; a movie description dataset.	Video, Text	LSMDC
DiDemo	10,000 videos with four concatenated captions per video; with temporal localization of events.	Video, Text	DiDemo
Breakfast	1,712 videos of breakfast preparation; one of the largest fully annotated video datasets.	Video, Text	Breakfast
COIN	11,827 instructional YouTube videos across 180 tasks; for comprehensive instructional video analysis.	Video, Text	COIN
MSRVTT-QA	Video question answering benchmark.	Video, Text	MSRVTT-QA
MSVD-QA	1,970 video clips with approximately 50.5K QA pairs; video QA dataset.	Video, Text	MSVD-QA
ActivityNet-QA	58,000 human–annotated QA pairs on 5,800 videos; benchmark for video QA models.	Video, Text	ActivityNet-QA
EpicKitchens-100	700 videos (100 hours of cooking activities) for online action prediction; egocentric vision dataset.	Video, Text	EPIC-KITCHENS-100
Ego4D	4.3M video–text pairs for egocentric videos; massive-scale egocentric video dataset.	Video, Text	Ego4D
HowTo100M	136M video clips with captions from 1.2M YouTube videos; for learning text–video embeddings.	Video, Text	HowTo100M
CharadesEgo	68,536 activity instances from ego–exo videos; used for evaluation.	Video, Text	Charades-Ego
ActivityNet Captions	20K videos with 3.7 temporally localized sentences per video; dense-captioning events in videos.	Video, Text	ActivityNet Captions
VATEX	34,991 videos, each with multiple captions; a multilingual video-and-language dataset.	Video, Text	VATEX
Charades	9,848 video clips with textual descriptions; a multimodal research dataset.	Video, Text	Charades
WebVid	10M video–text pairs (refined to WebVid-Refined-1M).	Video, Text	WebVid
Youku-mPLUG	Chinese dataset with 10M video–text pairs (refined to Youku-Refined-1M).	Video, Text	Youku-mPLUG

🔊 Audio-Text

Name	Statistics and Description	Modalities	Link
LibriSpeech	1,000 hours of read English speech with corresponding text; ASR corpus based on audiobooks.	Audio, Text	LibriSpeech
SpeechBrown	55K paired speech-text samples; 15 categories covering diverse topics from religion to fiction.	Audio, Text	SpeechBrown
AudioCap	46K audio clips paired with human-written text captions.	Audio, Text	AudioCaps
AudioSet	2M human-labeled sound clips from YouTube across diverse audio event classes (e.g., music or environmental).	Audio	AudioSet

🩺 Medical

Name	Statistics and Description	Modalities	Link
MIMIC-CXR	125,417 labeled chest X-rays with reports; widely used for medical imaging research.	Image, Text	MIMIC-CXR
CheXpert	224,316 chest radiographs of 65,240 patients; focused on medical analysis.	Image, Text	CheXpert
MIMIC-III	Health-related data from over 40K patients; includes clinical notes and structured data.	Text	MIMIC-III
IU-Xray	7,470 pairs of chest X-rays and corresponding diagnostic reports.	Image, Text	IU-Xray
PubLayNet	100,000 training samples and 2,160 test samples built from PubLayNet for document layout analysis.	Image, Text	PubLayNet

👗 Fashion

Name	Statistics and Description	Modalities	Link
Fashion-IQ	77,684 images across three categories; evaluated with Recall@10 and Recall@50 metrics.	Image, Text	Fashion-IQ
FashionGen	260.5K image–text pairs of fashion images and item descriptions.	Image, Text	FashionGen
VITON-HD	83K images for virtual try-on; high-resolution clothing items dataset.	Image, Text	VITON-HD
Fashionpedia	48,000 fashion images annotated with segmentation masks and fine-grained attributes.	Image, Text	Fashionpedia
DeepFashion	Approximately 800K diverse fashion images for pseudo triplet generation.	Image, Text	DeepFashion

💡 QA

Name	Statistics and Description	Modalities	Link
VQA	400K QA pairs with images for visual question-answering tasks.	Image, Text	VQA
PAQ	65M text-based QA pairs; a large-scale dataset for open-domain QA tasks.	Text	PAQ
ELI5	270K complex questions augmented with web pages and images; designed for long-form QA tasks.	Text	ELI5
OK-VQA	14K questions requiring external knowledge for visual question answering tasks.	Image, Text	OK-VQA
WebQA	46K queries requiring reasoning across text and images; multimodal QA dataset.	Text, Image	WebQA
Infoseek	Fine-grained visual knowledge retrieval using a Wikipedia-based knowledge base (~6M passages).	Image, Text	Infoseek
ClueWeb22	10 billion web pages organized into subsets; a large-scale web corpus for retrieval tasks.	Text	ClueWeb22
MOCHEG	15,601 claims annotated with truthfulness labels and accompanied by textual and image evidence.	Text, Image	MOCHEG
VQA v2	1.1M questions (augmented with VG-QA questions) for fine-tuning VQA models.	Image, Text	VQA v2
A-OKVQA	Benchmark for visual question answering using world knowledge; around 25K questions.	Image, Text	A-OKVQA
XL-HeadTags	415K news headline-article pairs spanning 20 languages across six diverse language families.	Text	XL-HeadTags
SEED-Bench	19K multiple-choice questions with accurate human annotations across 12 evaluation dimensions.	Text	SEED-Bench

🌎 Other

Name	Statistics and Description	Modalities	Link
ImageNet	14M labeled images across thousands of categories; used as a benchmark in computer vision research.	Image	ImageNet
Oxford Flowers102	Dataset of flowers with 102 categories for fine-grained image classification tasks.	Image	Oxford Flowers102
Stanford Cars	Images of different car models (five examples per model); used for fine-grained categorization tasks.	Image	Stanford Cars
GeoDE	61,940 images from 40 classes across six world regions; emphasizes geographic diversity in object recognition.	Image	GeoDE

📄 Papers

📚 RAG-related Surveys

👓 Retrieval Strategies Advances

🔍 Efficient-Search and Similarity Retrieval

❓ Maximum Inner Product Search-MIPS

💫 Multi-Modal Encoders

🎨 Modality-Centric Retrieval

📋 Text-Centric

📸 Vision-Centric

🎥 Video-Centric

📰 Document Retrieval and Layout Understanding

🥇🥈 Re-ranking Strategies

🎯 Optimized Example Selection

🧮 Relevance Score Evaluation

⏳ Filtering Mechanisms

🛠 Fusion Mechanisms

🎰 Score Fusion and Alignment

⚔ Attention-Based Mechanisms

🧩 Unified Frameworkes

🚀 Augmentation Techniques

💰 Context-Enrichment

🎡 Adaptive and Iterative Retrieval

🤖 Generation Techniques

🧠 In-Context Learning

👨‍⚖️ Reasoning

🤺 Instruction Tuning

📂 Source Attribution and Evidence Transparency

🔧 Training Strategies and Loss Function

REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory (April 2023)
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control (February 2024)
HACL: Hallucination Augmented Contrastive Learning for Multimodal Large Language Model (February 2024)
UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models (May 2024)
Improving Medical Multi-modal Contrastive Learning with Expert Annotations (November 2024)
EchoSight: Advancing Visual-Language Models with Wiki Knowledge (November 2024)

🛡️ Robustness and Noise Management

RA-CM3: Retrieval-Augmented Multimodal Language Modeling (January 2023)
MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training (RagVL) (July 2024)
MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning (August 2024)
RAGTrans: Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction (August 2024)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (October 2024)
Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval (November 2024)
AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed articles (December 2024)

🛠 Taks Addressed by Multimodal RAGs

🩺 Healthcare and Medicine

💻 Software Engineering

🕶️ Fashion and E-Commerce

🤹 Entertainment and Social Computing

🚗 Emerging Applications

📏 Evaluation Metrics

📊 Retrieval Performance

It represents the minimum value between precision (+P) and sensitivity (Se), providing a balanced measure of model performance.

🔗 Text Similarity and Overlap Metrics

📊 Statistical Metrics

Spearman’s Rank Correlation (SRC):
- Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation

⚙️ Efficiency and Computational Performance

Average Retrieval Time per Query:
- Advanced Embedding Techniques in Multimodal Retrieval Augmented Generation: A Comprehensive Study on Cross Modal AI Applications
FLOPs (Floating Point Operations):
- Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control
Response Time:
- Multi-modal Retrieval Augmented Generation for Product Query
Execution Time:
- SoccerRAG: Multimodal Soccer Information Retrieval via Natural Queries
Average Retrieval Number (ARN):
- Self-adaptive Multimodal Retrieval-Augmented Generation

🏥 Domain-Specific Metrics

Clinical Relevance (CR):
- AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed articles
Geodesic Distance:
- Img2Loc: Revisiting Image Geolocalization Using Multi-Modality Foundation Models and Image-Based Retrieval-Augmented Generation

This README is a work in progress and will be completed soon. Stay tuned for more updates!

🔗 Citations

If you find our paper or repository useful, please cite the paper:

@misc{abootorabi2025askmodalitycomprehensivesurvey,
      title={Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation}, 
      author={Mohammad Mahdi Abootorabi and Amirhosein Zobeiri and Mahdi Dehghani and Mohammadali Mohammadkhani and Bardia Mohammadi and Omid Ghahroodi and Mahdieh Soleymani Baghshah and Ehsaneddin Asgari},
      year={2025},
      eprint={2502.08826},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08826}, 
}

📧 Contact

If you have questions, please send an email to mahdi.abootorabi2@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.gitignore		.gitignore
README.md		README.md

llm-lab-org/Multimodal-RAG-Survey

Folders and files

Latest commit

History

Repository files navigation