This repository is designed to collect and categorize papers related to Multimodal Retrieval-Augmented Generation (RAG) according to our survey paper: Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. Given the rapid growth in this field, we will continuously update both the paper and this repository to serve as a resource for researchers working on future projects.
-
February 17, 2025: We release the first survey for Multimodal Retrieval-Augmented Generation.
-
April 18, 2025: Our website for this topic is up now.
Feel free to cite, contribute, or open a pull request to add recent related papers!
- ๐ General Pipeline
- ๐ฟ Taxonomy of Recent Advances and Enhancements
- โ Taxonomy of Application Domains
- ๐ Abstract
- ๐ Overview of Popular Datasets
- ๐ Papers
- ๐ RAG-related Surveys
- ๐ Retrieval Strategies Advances
- ๐ Fusion Mechanisms
- ๐ Augmentation Techniques
- ๐ค Generation Techniques
- ๐ง Training Strategies and Loss Function
- ๐ก๏ธ Robustness and Noise Management
- ๐ Taks Addressed by Multimodal RAGs
- ๐ Evaluation Metrics
- ๐ Citations
- ๐ง Contact
Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG.
This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases.
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
LAION-400M | 200M imageโtext pairs; used for pre-training multimodal models. | Image, Text | LAION-400M |
Conceptual-Captions (CC) | 15M imageโcaption pairs; multilingual EnglishโGerman image descriptions. | Image, Text | Conceptual Captions |
CIRR | 36,554 triplets from 21,552 images; focuses on natural image relationships. | Image, Text | CIRR |
MS-COCO | 330K images with captions; used for caption-to-image and image-to-caption generation. | Image, Text | MS-COCO |
Flickr30K | 31K images annotated with five English captions per image. | Image, Text | Flickr30K |
Multi30K | 30K German captions from native speakers and human-translated captions. | Image, Text | Multi30K |
NoCaps | For zero-shot image captioning evaluation; 15K images. | Image, Text | NoCaps |
Laion-5B | 5B imageโtext pairs used as external memory for retrieval. | Image, Text | LAION-5B |
COCO-CN | 20,341 images for cross-lingual tagging and captioning with Chinese sentences. | Image, Text | COCO-CN |
CIRCO | 1,020 queries with an average of 4.53 ground truths per query; for composed image retrieval. | Image, Text | CIRCO |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
BDD-X | 77 hours of driving videos with expert textual explanations; for explainable driving behavior. | Video, Text | BDD-X |
YouCook2 | 2,000 cooking videos with aligned descriptions; focused on videoโtext tasks. | Video, Text | YouCook2 |
ActivityNet | 20,000 videos with multiple captions; used for video understanding and captioning. | Video, Text | ActivityNet |
SoccerNet | Videos and metadata for 550 soccer games; includes transcribed commentary and key event annotations. | Video, Text | SoccerNet |
MSR-VTT | 10,000 videos with 20 captions each; a large video description dataset. | Video, Text | MSR-VTT |
MSVD | 1,970 videos with approximately 40 captions per video. | Video, Text | MSVD |
LSMDC | 118,081 videoโtext pairs from 202 movies; a movie description dataset. | Video, Text | LSMDC |
DiDemo | 10,000 videos with four concatenated captions per video; with temporal localization of events. | Video, Text | DiDemo |
Breakfast | 1,712 videos of breakfast preparation; one of the largest fully annotated video datasets. | Video, Text | Breakfast |
COIN | 11,827 instructional YouTube videos across 180 tasks; for comprehensive instructional video analysis. | Video, Text | COIN |
MSRVTT-QA | Video question answering benchmark. | Video, Text | MSRVTT-QA |
MSVD-QA | 1,970 video clips with approximately 50.5K QA pairs; video QA dataset. | Video, Text | MSVD-QA |
ActivityNet-QA | 58,000 humanโannotated QA pairs on 5,800 videos; benchmark for video QA models. | Video, Text | ActivityNet-QA |
EpicKitchens-100 | 700 videos (100 hours of cooking activities) for online action prediction; egocentric vision dataset. | Video, Text | EPIC-KITCHENS-100 |
Ego4D | 4.3M videoโtext pairs for egocentric videos; massive-scale egocentric video dataset. | Video, Text | Ego4D |
HowTo100M | 136M video clips with captions from 1.2M YouTube videos; for learning textโvideo embeddings. | Video, Text | HowTo100M |
CharadesEgo | 68,536 activity instances from egoโexo videos; used for evaluation. | Video, Text | Charades-Ego |
ActivityNet Captions | 20K videos with 3.7 temporally localized sentences per video; dense-captioning events in videos. | Video, Text | ActivityNet Captions |
VATEX | 34,991 videos, each with multiple captions; a multilingual video-and-language dataset. | Video, Text | VATEX |
Charades | 9,848 video clips with textual descriptions; a multimodal research dataset. | Video, Text | Charades |
WebVid | 10M videoโtext pairs (refined to WebVid-Refined-1M). | Video, Text | WebVid |
Youku-mPLUG | Chinese dataset with 10M videoโtext pairs (refined to Youku-Refined-1M). | Video, Text | Youku-mPLUG |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
LibriSpeech | 1,000 hours of read English speech with corresponding text; ASR corpus based on audiobooks. | Audio, Text | LibriSpeech |
SpeechBrown | 55K paired speech-text samples; 15 categories covering diverse topics from religion to fiction. | Audio, Text | SpeechBrown |
AudioCap | 46K audio clips paired with human-written text captions. | Audio, Text | AudioCaps |
AudioSet | 2M human-labeled sound clips from YouTube across diverse audio event classes (e.g., music or environmental). | Audio | AudioSet |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
MIMIC-CXR | 125,417 labeled chest X-rays with reports; widely used for medical imaging research. | Image, Text | MIMIC-CXR |
CheXpert | 224,316 chest radiographs of 65,240 patients; focused on medical analysis. | Image, Text | CheXpert |
MIMIC-III | Health-related data from over 40K patients; includes clinical notes and structured data. | Text | MIMIC-III |
IU-Xray | 7,470 pairs of chest X-rays and corresponding diagnostic reports. | Image, Text | IU-Xray |
PubLayNet | 100,000 training samples and 2,160 test samples built from PubLayNet for document layout analysis. | Image, Text | PubLayNet |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
Fashion-IQ | 77,684 images across three categories; evaluated with Recall@10 and Recall@50 metrics. | Image, Text | Fashion-IQ |
FashionGen | 260.5K imageโtext pairs of fashion images and item descriptions. | Image, Text | FashionGen |
VITON-HD | 83K images for virtual try-on; high-resolution clothing items dataset. | Image, Text | VITON-HD |
Fashionpedia | 48,000 fashion images annotated with segmentation masks and fine-grained attributes. | Image, Text | Fashionpedia |
DeepFashion | Approximately 800K diverse fashion images for pseudo triplet generation. | Image, Text | DeepFashion |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
VQA | 400K QA pairs with images for visual question-answering tasks. | Image, Text | VQA |
PAQ | 65M text-based QA pairs; a large-scale dataset for open-domain QA tasks. | Text | PAQ |
ELI5 | 270K complex questions augmented with web pages and images; designed for long-form QA tasks. | Text | ELI5 |
OK-VQA | 14K questions requiring external knowledge for visual question answering tasks. | Image, Text | OK-VQA |
WebQA | 46K queries requiring reasoning across text and images; multimodal QA dataset. | Text, Image | WebQA |
Infoseek | Fine-grained visual knowledge retrieval using a Wikipedia-based knowledge base (~6M passages). | Image, Text | Infoseek |
ClueWeb22 | 10 billion web pages organized into subsets; a large-scale web corpus for retrieval tasks. | Text | ClueWeb22 |
MOCHEG | 15,601 claims annotated with truthfulness labels and accompanied by textual and image evidence. | Text, Image | MOCHEG |
VQA v2 | 1.1M questions (augmented with VG-QA questions) for fine-tuning VQA models. | Image, Text | VQA v2 |
A-OKVQA | Benchmark for visual question answering using world knowledge; around 25K questions. | Image, Text | A-OKVQA |
XL-HeadTags | 415K news headline-article pairs spanning 20 languages across six diverse language families. | Text | XL-HeadTags |
SEED-Bench | 19K multiple-choice questions with accurate human annotations across 12 evaluation dimensions. | Text | SEED-Bench |
Name | Statistics and Description | Modalities | Link |
---|---|---|---|
ImageNet | 14M labeled images across thousands of categories; used as a benchmark in computer vision research. | Image | ImageNet |
Oxford Flowers102 | Dataset of flowers with 102 categories for fine-grained image classification tasks. | Image | Oxford Flowers102 |
Stanford Cars | Images of different car models (five examples per model); used for fine-grained categorization tasks. | Image | Stanford Cars |
GeoDE | 61,940 images from 40 classes across six world regions; emphasizes geographic diversity in object recognition. | Image | GeoDE |
- Retrieval-Augmented Generation for Large Language Models: A Survey
- Benchmarking Large Language Models in Retrieval-Augmented Generation
- Old IR Methods Meet RAG
- A Survey on Retrieval-Augmented Text Generation
- Graph Retrieval-Augmented Generation: A Survey
- A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models
- RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing
- Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make Your LLMs Use External Data More Wisely
- Searching for Best Practices in Retrieval-Augmented Generation
- Retrieval-Augmented Generation for Natural Language Processing: A Survey
- A Survey on Retrieval-Augmented Text Generation for Large Language Models
- Graph Retrieval-Augmented Generation for Large Language Models: A Survey
- Trustworthiness in Retrieval-Augmented Generation Systems: A Survey
- Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
- ADQ: Adaptive Dataset Quantization
- Query-Aware Quantization for Maximum Inner Product Search
- TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s
- ScaNN: Accelerating large-scale inference with anisotropic vector quantization
- BanditMIPS: Faster Maximum Inner Product Search in High Dimensions
- MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality
- FARGO: Fast Maximum Inner Product Search via Global Multi-Probing
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
- RA-CM3: Retrieval-Augmented Multimodal Language Modeling
- Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search
- Revisiting Neural Retrieval on Accelerators
- DeeperImpact: Optimizing Sparse Learned Index Structures
- RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
- Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation
- Mi-RAG: Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering
- Ovis: Structural Embedding Alignment for Multimodal Large Language Model
- GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
- MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin
- VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning
- UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
- UniVL-DR: Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval
- FLAVA: A Foundational Language And Vision Alignment Model
- ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- CLIP: Learning Transferable Visual Models From Natural Language Supervision
- M2RAG: Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines
- CRAG: Corrective Retrieval Augmented Generation
- RAFT: Adapting Language Model to Domain Specific RAG
- PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers
- CRAG: Corrective Retrieval Augmented Generation
- OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems
- XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags
- BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
- Re-Imagen: Retrieval-Augmented Text-to-Image Generator
- GTE: Towards General Text Embeddings with Multi-stage Contrastive Learning
- Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning
- VISA: Retrieval Augmented Generation with Visual Source Attribution
- Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
- EchoSight: Advancing Visual-Language Models with Wiki Knowledge
- XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags
- RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
- eCLIP: Improving Medical Multi-modal Contrastive Learning with Expert Annotations
- Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
- Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
- VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering
- VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
- VideoRAG: Retrieval-Augmented Generation over Video Corpus
- iRAG: Advancing RAG for Videos with an Incremental Approach
- Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
- Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
- MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval
- OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- CTCH: Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
- Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval
- DrVideo: Document Retrieval Based Long Video Understanding
- ColPali: Efficient Document Retrieval with Vision Language Models
- ColQwen2: Enhancing Vision-Language Model's Perception of the World at Any Resolution
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
- ViTLP: Visually Guided Generative Text-Layout Pre-training for Document Intelligence
- DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding
- CREAM: Coarse-to-Fine Retrieval and Multi-modal Efficient Tuning for Document VQA
- mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
- mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
- VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
- DSE: Unifying Multimodal Retrieval via Document Screenshot Embedding
- Robust Multi Model RAG Pipeline For Documents Containing Text, Table & Images
- SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding
- MSIER: How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
- Hybrid RAG-empowered Multi-modal LLM for Secure Data Management in Internet of Medical Things: A Diffusion-based Contract Approach
- RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
- RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
- M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance
- Re-ranking the Context for Multimodal Retrieval Augmented Generation
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
- mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
- LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval
- RAGTrans: Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction
- OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems
- EgoInstructor: Retrieval-Augmented Egocentric Video Captioning
- MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation
- MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
- GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
- MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering
- RAFT: Adapting Language Model to Domain Specific RAG
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications
- Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
- C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
- MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
- VISA: Retrieval Augmented Generation with Visual Source Attribution
- REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
- MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
- Large Language Models Know What is Key Visual Entity: An LLM-assisted Multimodal Retrieval for VQA
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
- Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
- Re-Imagen: Retrieval-Augmented Text-to-Image Generator
- VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
- RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
- EMERGE: Integrating RAG for Improved Multimodal EHR Predictive Modeling
- MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning
- AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed articles
- RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
- Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction
- MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval
- M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- Retrieval-augmented egocentric video captioning
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- Hybrid RAG-Empowered Multi-Modal LLM for Secure Data Management in Internet of Medical Things: A Diffusion-Based Contract Approach
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
- PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
- Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval
- Self-adaptive Multimodal Retrieval-Augmented Generation
- UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity - CVPR 2024
- Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control
- EMERGE: Enhancing Multimodal Electronic Health Records Predictive Modeling with Retrieval-Augmented Generation
- Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering
- Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
- Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
- Img2Loc: Revisiting Image Geolocalization Using Multi-Modality Foundation Models and Image-Based Retrieval-Augmented Generation
- Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
- Enhancing Multi-modal Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems
- Self-adaptive Multimodal Retrieval-Augmented Generation
- MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
- Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
- mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
- RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models
- Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- Retrieval-Augmented Multimodal Language Modeling (RA-CM3)
- RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
- How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning? (MSIER)
- RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models
- VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
- Self-adaptive Multimodal Retrieval-Augmented Generation
- LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
- InstructBLIP: towards general-purpose vision-language models with instruction tuning
- Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning
- mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
- MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training (RagVL)
- Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
- MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
- MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
- SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
- Rule: Reliable multimodal rag for factuality in medical vision language models
- MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering
- VISA: Retrieval Augmented Generation with Visual Source Attribution
- OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems
- REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory (April 2023)
- Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control (February 2024)
- HACL: Hallucination Augmented Contrastive Learning for Multimodal Large Language Model (February 2024)
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models (May 2024)
- Improving Medical Multi-modal Contrastive Learning with Expert Annotations (November 2024)
- EchoSight: Advancing Visual-Language Models with Wiki Knowledge (November 2024)
- RA-CM3: Retrieval-Augmented Multimodal Language Modeling (January 2023)
- MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training (RagVL) (July 2024)
- MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning (August 2024)
- RAGTrans: Retrieval-Augmented Hypergraph for Multimodal Social Media Popularity Prediction (August 2024)
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (October 2024)
- Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval (November 2024)
- AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed articles (December 2024)
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
- REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
- RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- Retrieval-augmented multimodal language modeling
- Re-imagen: Retrieval-augmented text-to-image generator
- A comprehensive survey of hallucination mitigation techniques in large language models
- Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
- Ramm: Retrieval-augmented biomedical visual question answering with multi-modal pre-training
- Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
- Ragar, your falsehood radar: Rag-augmented reasoning for political fact-checking using multimodal large language models
- Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval
- M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- Self-adaptive Multimodal Retrieval-Augmented Generation
- MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
- RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
- AsthmaBot: Multi-modal, Multi-Lingual Retrieval Augmented Generation For Asthma Patient Support
- REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models
- Hybrid RAG-Empowered Multi-Modal LLM for Secure Data Management in Internet of Medical Things: A Diffusion-Based Contract Approach
- Fact-aware multimodal retrieval augmentation for accurate medical radiology report generation
- Docprompting: Generating code by retrieving the docs
- RACE: Retrieval-augmented commit message generation
- Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning
- Retrieval Augmented Code Generation and Summarization
- Unifashion: A unified vision-language model for multimodal fashion retrieval and generation
- Multi-modal Retrieval Augmented Generation for Product Query
- LLM4DESIGN: An Automated Multi-Modal System for Architectural and Environmental Design
- SoccerRAG: Multimodal Soccer Information Retrieval via Natural Queries
- Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation
- RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
- ENWAR: A RAG-empowered Multi-Modal LLM Framework for Wireless Environment Perception
- Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications
- Img2Loc: Revisiting image geolocalization using multi-modality foundation models and image-based retrieval-augmented generation
-
Recall@K, Precision@K, F1 Score, and MRR:
- VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
- MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training (RagVL)
- Large Language Models Know What is Key Visual Entity: An LLM-assisted Multimodal Retrieval for VQA
- Self-adaptive Multimodal Retrieval-Augmented Generation
- MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval
- Retrieval-augmented egocentric video captioning
- Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
- Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
- M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
- Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control
- Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval
- Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering
- Rule: Reliable multimodal rag for factuality in medical vision language models
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
- REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models
- EMERGE: Integrating RAG for Improved Multimodal EHR Predictive Modeling
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
- RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
-
min(+P, Se):
It represents the minimum value between precision (+P) and sensitivity (Se), providing a balanced measure of model performance.
- REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models
- EMERGE: Integrating RAG for Improved Multimodal EHR Predictive Modeling
- Fluency (FL):
- Accuracy:
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
- How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
- RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
- Advanced Embedding Techniques in Multimodal Retrieval Augmented Generation: A Comprehensive Study on Cross Modal AI Applications
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- MRAG-BENCH: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
- Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation
- MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-Augmented Generation via Knowledge-Enhanced Reranking and Noise-Injected Training (RagVL)
- RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- Retrieval Meets Reasoning: Even High-School Textbook Knowledge Benefits Multimodal Reasoning
-
Frรฉchet Inception Distance (FID), CLIP Score, Kernel Inception Distance (KID), and Inception Score (IS):
- Retrieval-Augmented Multimodal Language Modeling (RA-CM3)
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
-
Consensus-Based Image Description Evaluation (CIDEr):
- Retrieval-Augmented Multimodal Language Modeling (RA-CM3)
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
- MSIER: How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
- RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
- RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- Retrieval-augmented egocentric video captioning
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
-
SPICE:
-
SPIDEr:
- Frรฉchet Audio Distance (FAD), Overall Quality (OVL), and Text Relevenace (REL):
-
BLEU, METEOR, and ROUGE-L:
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
- Rule: Reliable multimodal rag for factuality in medical vision language models
- RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
- Advanced Embedding Techniques in Multimodal Retrieval Augmented Generation: A Comprehensive Study on Cross Modal AI Applications
- UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models
- AsthmaBot: Multi-modal, Multi-Lingual Retrieval Augmented Generation For Asthma Patient Support
- RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- Retrieval-augmented egocentric video captioning
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
- XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags
- Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation
-
Exact Match (EM):
- Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
- MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training (RagVL)
- Large Language Models Know What is Key Visual Entity: An LLM-assisted Multimodal Retrieval for VQA
- UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models (May 2024)
- Self-adaptive Multimodal Retrieval-Augmented Generation
- Iterative Retrieval Augmentation for Multi-Modal Knowledge Integration and Generation
- OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
-
BERTScore:
- Spearmanโs Rank Correlation (SRC):
-
Average Retrieval Time per Query:
-
FLOPs (Floating Point Operations):
-
Response Time:
-
Execution Time:
-
Average Retrieval Number (ARN):
- Clinical Relevance (CR):
- Geodesic Distance:
This README is a work in progress and will be completed soon. Stay tuned for more updates!
If you find our paper or repository useful, please cite the paper:
@misc{abootorabi2025askmodalitycomprehensivesurvey,
title={Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation},
author={Mohammad Mahdi Abootorabi and Amirhosein Zobeiri and Mahdi Dehghani and Mohammadali Mohammadkhani and Bardia Mohammadi and Omid Ghahroodi and Mahdieh Soleymani Baghshah and Ehsaneddin Asgari},
year={2025},
eprint={2502.08826},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.08826},
}
If you have questions, please send an email to mahdi.abootorabi2@gmail.com.