Surveying the MLLM Landscape: A Meta-Review of Current Surveys

by Ming Li, Keyu Chen, Ziqian Bi https://arxiv.org/abs/2409.18991

Abstract

Multimodal Large Language Models (MLLMs)

Rise of MLLMs:

Transformative force in AI field
Enabling machines to process/generate content across modalities (text, images, audio, video)
Significant advancement over unimodal systems

Applications of MLLMs:

Autonomous agents
Medical diagnostics
Achieve more holistic understanding of information, mimicking human perception

Importance of Performance Evaluation:

As capabilities of MLLMs expand
Comprehensive and accurate evaluation becomes critical

Survey Objectives:

Provide systematic review of benchmark tests/evaluation methods for MLLMs
Cover key topics:
- Foundational concepts
- Applications
- Evaluation methodologies
- Ethical concerns
- Security
- Efficiency
- Domain-specific applications

Summary of Existing Literature:

Classify and analyze existing surveys on MLLMs
Conduct comparative analysis
Examine impact within the academic community

Emerging Trends and Underexplored Areas:

Identify emerging trends in MLLM research
Propose potential directions for future studies

Survey Intended Usage:

Offer researchers/practitioners a comprehensive understanding of current state of MLLM evaluation
Facilitate further progress in the rapidly evolving field

1 Introduction

Introduction:

Purpose of Survey: synthesize key insights from existing MLLM surveys, organize them into 11 core areas
Goals: provide structured overview of critical advancements, identify trends and challenges, suggest future research directions

Background:

Multimodal Large Language Models (MLLMs) allow processing diverse modalities, enabling comprehensive applications
Rapid development, leading to a wealth of surveys on specific aspects
Complexity and volume of literature make it difficult for researchers and practitioners to grasp the field's state

Approach:

Synthesize findings from various surveys
Identify major themes, trends, and challenges in MLLM evaluation
Examine current methodologies for application and assessment
Highlight future research directions and underexplored areas within the MLLM landscape.

2 Classical Natural Language Processing Methods

Natural Language Processing (NLP) Development Stages

Statistical and rule-based methods: Early focus of NLP research
- Markov process used for sequence data handling
- Hidden Markov Models (HMMs) introduced for continuous speech recognition
- N-gram models established simple yet effective foundation for language models

Neural Networks and Deep Learning

Long Short-Term Memory (LSTM) networks: Proposed in 1997 to handle long sequential data
Recurrent Neural Networks (RNNs) further developed for document recognition in 1998

Word Embeddings

Word2Vec model: Introduced in 2003, capturing semantic relationships between words
- Skip-gram and CBOW models improved quality of word vectors

Deep Learning Advancements

ImageNet launched for object recognition and image classification tasks (2009)
- Abundant data resources enabled training deep neural networks on large datasets
AlexNet: Successful in ImageNet competition, improving image classification performance (2012)

Sequence-to-Sequence Learning (Seq2Seq) Model

Proposed for machine translation using an encoder-decoder structure (2014)

Attention Mechanism

Introduced by Bahdanau et al. in 2015 to dynamically align source and target sequences
Improved machine translation performance

Byte Pair Encoding (BPE) Algorithm

Proposed to handle rare and out-of-vocabulary words through subword units (2016)

ELMo Model

Introduced in 2018, capturing contextual information through bidirectional LSTM networks
Improved the quality of word representations by capturing lexical semantics and syntactic information.

3 Large Language Models (LLM) and Multimodal Large Language Models (MLLM)

3.1 Transformer, BERT, and GPT

Transformer, BERT, and GPT:

Foundation models of large language models
Play a crucial role in the field of natural language processing (NLP)
Drive development of numerous applications and research advancements

3.2 Transformer

Transformer Model

Components:

Encoder: self-attention mechanism, feed-forward neural network
Decoder: self-attention mechanism, encoder-decoder attention mechanism, feed-forward neural network

Self-Attention Mechanism:

Operates on query matrix Q, key matrix K, value matrix V
Computation: Attention(Q,K,V) = softmax(QKTdk)V

Multi-Head Attention Mechanism:

Compute multiple self-attention heads in parallel and concatenate results
Each head computed as: headi = Attention(QWiQ, KWiK, VWiV)

Feed-Forward Neural Network:

Consists of two linear transformations and a ReLU activation function
Helps transform input data into more abstract representation before passing to next layer

Transformer Variants:

BERT (Bidirectional Encoder Representations from Transformers)
- Generates contextual representations of words using bidirectional encoder
- Unsupervised pre-training on large corpora, followed by fine-tuning for specific tasks
- Pre-training tasks: Masked Language Model (MLM), Next Sentence Prediction (NSP)
GPT (Generative Pretrained Transformer)
- Mainly uses unidirectional decoder to generate text
- Unsupervised pre-training on large corpora, followed by fine-tuning for specific tasks
- Training consists of two stages: pre-training and fine-tuning
- Pre-training uses autoregressive method to predict next word, learning sequence information of words.

3.3 Multimodal Large Language Models (MLLM)

Multimodal Large Language Models (MLLM)

Overview:

Significant advancement in AI field
Integrates multiple modalities: text, images, audio
Enhances versatility and applicability to various tasks
Core challenge: effective modality alignment

Modality Alignment:

Achieved through mapping different types of data into a common representation space
Improves performance on image captioning, visual question answering, multimodal translation
Enables more natural and comprehensive human-computer interactions

Early Methods:

Manually labeled data: costly and inefficient
Examples: DeViSE, VSE, CCA
Performance limited due to restricted dataset size

Contrastive Learning-based Multimodal Alignment:

Bring similar sample pairs closer; push dissimilar pairs apart in shared representation space
Improved performance through large unlabeled image datasets and contrastive loss functions

CLIP (Contrastive Language-Image Pre-Training):

Major breakthrough in multimodal data alignment
Leverages large-scale dataset of image-text pairs
Employs a contrastive loss function for robust text-image alignment
Encodes images and texts into common representation space: high similarity for matched pairs, low similarity for mismatched pairs
Pretraining on hundreds of millions of image-text pairs enables rich cross-modal representations
Strong zero-shot learning capabilities, can be directly used for feature extraction and applications without fine-tuning
Text and image encoders can be used independently in various downstream tasks like METER, SimVLM, BLIP-2, Stable Diffusion series.

Model-Based Techniques for Multimodal Alignment

ALIGN (Large-scale Image-Language Pretraining):

Employs a dual-encoder architecture that aligns images and text in a shared embedding space using contrastive learning
Significantly improves zero-shot image classification and retrieval

Florence:

Incorporates a vision transformer model with a text encoder to align image and text modalities
Demonstrates superior performance in visual understanding tasks

Model-Based Approaches:

Focus on pretraining using vast amounts of weakly labeled image-text data
Contribute to better generalization across a range of vision-language benchmarks

UNITER (UNiversal Image-TExt Representation):

Jointly learns image-text alignments using a transformer-based architecture
Introduces a unique masked language modeling task to improve contextual understanding between modalities

ViLT (Vision-and-Language Transformer):

Reduces the complexity of multimodal fusion by directly feeding visual patch embeddings into a transformer
Bypasses the need for separate image encoders while maintaining competitive performance in tasks like image-text retrieval and visual question answering

Model-Based Techniques:

Highlight the versatility and scalability of transformer-based approaches in multimodal alignment tasks
Address both high-level semantic matching and fine-grained object-level alignment

4 Taxonomy and Categorization

4.1 Methodology

Survey Synthesis Paper on Latest Machine Learning and Language Models (MLLM)

Methodology

This paper consolidates findings from 58 most recent and advanced surveys in MLLM domain
Surveys are categorized into key themes: model architectures, evaluation, applications, security, bias, future directions.
Selection criteria: recency and broad coverage of MLLM domain

Analysis of Each Survey

Technical focus: architectures, models, datasets
Applications: computer vision, healthcare, robotics etc.
Security and biases: model safety, fairness, robustness.
Emerging trends: future directions, new paradigms.

4.2 Applications and Agents

Applications and Agents

Legal Domain:

LLMs used for tasks like legal advice, document processing, and judicial assistance
Specialized LLMs, such as LawGPT and LexiLaw, address legal language and reasoning nuances
Challenges: Maintaining judicial independence, ethical implications of biased data in legal decision-making

Autonomous Driving:

MLLMs used to improve perception, decision-making, human-vehicle interaction
Key trend: Integration of multimodal data (LiDAR, maps, images) enhances vehicle's ability to process complex driving environments
Challenges: Real-time data processing, safety in diverse driving conditions

Mathematics:

LLMs used for mathematical tasks like calculation and reasoning
Technique like Chain-of-Thought (CoT) prompting improves model performance
Scarcity of high-quality datasets, complexity of mathematical reasoning are ongoing challenges

Healthcare:

Multimodal learning used for image fusion, report generation, cross-modal retrieval
Rise of foundation models like GPT-4 and CLIP in processing medical data
Advancements not yet achieved universal intelligence; concerns related to data integration, ethical considerations remain barriers

Robotics:

LLMs used to improve perception, decision-making, control in robotic systems
Key trend: Potential for LLMs to advance embodied intelligence, where robots understand and interact with the physical world
Challenges: Real-time perception, control, integration with existing technologies

Multilingualism:

LLMs can process multiple languages, but challenges remain for low-resource languages and security issues in multilingual models
Techniques like multilingual Chain-of-Thought reasoning show promise for future development

Gaming:

LLMs used to generate dynamic Non-Player Characters (NPCs), enhance player interaction, assist in game design
Challenge: Hallucinations where LLMs generate plausible but incorrect outputs are a major limitation
Improving context management and memory within gaming systems is a future research priority

Audio Processing:

LLMs used for Automatic Speech Recognition (ASR) and music generation
Key trend: Integration of multimodal data from speech, music, and environmental sounds into a single model
Challenges: Scalability, data diversity

Video Understanding:

Vid-LLMs combine video, text, and audio inputs to analyze and understand video content
Promising for tasks like video summarization and captioning
Challenges: Processing long videos, maintaining contextual coherence

Citation Analysis:

LLMs significantly improve citation recommendation, classification, and summarization tasks
Citation data enhances LLM performance by enabling multi-hop knowledge across documents
Future research needs to address the expansion of citation networks and integration of non-academic sources.

4.3 Evaluation and Benchmarks

Evaluation and Benchmarks of Multimodal Large Language Models (MLLMs)

Importance: Evaluation and benchmarking are crucial for understanding MLLM performance across a diverse range of tasks and datasets.

Categorizing Evaluations

Perception and Understanding: Interpreting multimodal inputs, integrating information across modalities (VQA datasets like VQAv2).
Cognition and Reasoning: Logical reasoning, problem-solving, multimodal reasoning (complex visual question answering).
Robustness and Safety: Performing under adversarial prompts or out-of-distribution data. Managing hallucinations and biases.
Domain-Specific Capabilities: Medical image interpretation, legal text analysis (TallyQA).
Task-Specific Benchmarks: Traditional vs. advanced datasets (VQAv2 vs. TDIUC).
Multimodal Reasoning and Interaction: Evaluating interaction between modalities (multimodal dialogue, visual question answering).
Ethical and Societal Implications: Ensuring fairness and avoiding societal biases. Ensuring trustworthiness and safety (handling uncertainty, avoiding hallucinations).

Micro vs. Macro Performance: Analyze both micro performance (equal weight) and macro performance (averaged) for a clearer understanding of model strengths and weaknesses.

Generalization Across Tasks: Evaluate models not only on specific tasks but also on their ability to generalize across different question types.

4.4 Efficiency and Adaptation

MLLM Surveys: Efficiency and Adaptation

Efficiency Approaches for MLLMs:

Xu et al. [62] and Jin et al. [63] emphasize the need for accessibility in resource-constrained environments
Comprehensive taxonomies cover:
- Advancements in architectures
- Vision-language integration
- Training methods
- Benchmarks that optimize MLLM efficiency
Key methods discussed:
- Vision token compression
- Parameter-efficient fine-tuning
- Transformer-alternative models: Mixture of Experts (MoE), state space models
Balancing computational efficiency with task performance is the goal

Adapting MLLMs to Specific Tasks:

Liu et al. [64] address adapting MLLMs to specific tasks with limited labeled data
Categorizes approaches into:
- Prompt-based methods
- Adapter-based methods
- External knowledge-based methods
These approaches help models generalize better in fine-grained domains
Few-shot adaptation techniques: visual prompt tuning, adapter fine-tuning are critical for extending the usability of large multimodal models

Challenges Ahead:

Domain adaptation, model selection, and integration of external knowledge remain ongoing challenges
Future: MLLMs become more efficient, flexible, and adaptive in handling diverse real-world tasks.

4.5 Data-centric

Bai et al.'s Survey on Data-Centric Approach for Multimodal Large Language Models (MLLM)

Importance of Quality, Diversity, and Volume of Multimodal Data:

Emphasizes the significance of high-quality, diverse, and abundant multimodal data (text, images, audio, video) in training MLLMs effectively.

Challenges in Multimodal Data Collection:

Identifies issues: data sparsity and noise
Potential solutions: synthetic data generation and active learning to mitigate these challenges.

Advocating for a More Data-Centric Approach:

Refinement and curation of data take precedence over model development.
Improves model performance and advances MLLM capabilities in complex landscapes.

Yang et al.'s Research on Large Language Models (LLMs) and Code:

Role of Code in LLMs:

Enhances LLMs’ capacity to function as intelligent agents through automation, code generation, software development tasks.

Challenges Around Code Integration:

Correctness: ensuring the accuracy of generated or manipulated code.
Efficiency: optimizing resources for large-scale code processing.
Security: protecting sensitive data during model training and deployment.

Potential of LLMs to Become Autonomous Agents:

Managing complex tasks across various domains.
Broadening the scope and impact of LLMs in AI-driven innovations.

4.6 Contunual Learning

Large Language Models (LLMs) Integration with External Knowledge

Integration Methods:

Knowledge editing: Modifying input or model to update outdated/incorrect information
Retrieval augmentation: Fetching external information during inference without altering core parameters

Approaches and Taxonomy:

Categorizes integration methods into knowledge editing and retrieval augmentation
Benchmarks for evaluation, applications: LangChain, ChatDoctor
Handling knowledge conflicts, future research directions

Continual Learning (CL) of LLMs:

Shi et al. survey: Extensive overview of CL challenges and methodologies in the field
Two primary directions: vertical continual learning (general to specific domains), horizontal continual learning (across time across various domains)
Problems: "catastrophic forgetting" (models lose knowledge), maintaining performance on old and new tasks

CL Methods:

Continual pre-training
Domain-adaptive pre-training
Continual fine-tuning

Mitigating Forgetting:

Replay-based methods: Reusing samples from previous tasks
Regularization-based methods: Adding constraints to prevent forgetting
Architecture-based methods: Designing model structures that improve knowledge retention

Call for Research:

Evaluation benchmarks and methodologies to counter forgetting
Supporting knowledge transfer in LLMs

4.7 Evaluation Benchmarks

Multimodal Large Language Models (MLLMs) Evaluation Benchmarks

Challenges of Evaluating MLLMs:

Lu et al. [55] discuss the complexities of evaluating MLLMs, focusing on tasks like cross-modal retrieval, caption generation, and Visual Question Answering (VQA)
Current evaluation frameworks fail to capture the nuances of multimodal interactions

Need for Tailored Evaluation Metrics:

Li and Lu [69] argue the absence of standardized protocols across different modalities hinders fair comparison of models
They call for establishment of consistent evaluation frameworks that ensure reliable and equitable performance assessment

Evaluation Methodologies for Large Language Models (LLMs):

Chang et al. [57] examine evaluation methodologies for LLMs, emphasizing the importance of moving beyond task-specific performance metrics
They point out existing benchmarks overlook critical issues like hallucinations, fairness, and societal implications
The authors advocate for more holistic evaluation practices that consider both technical capabilities and trustworthiness, robustness, and ethical alignment in real-world applications.

4.8 Agents and Autonomous Systems

Autonomous Agents Leveraging Large Language Models (LLMs)

Components of Autonomous Agents:

Perception: Agents perceive their environment
Memory: Agents recall previous interactions
Planning: Agents plan actions in real-time
Action: Agents execute actions

Benefits of this Architecture:

Highly adaptable across various domains (e.g., digital assistants, autonomous vehicles)
Incorporating multimodal inputs expands the perception-action loop

Challenges:

Knowledge boundaries: Agents are constrained in specialized or underexplored domains
Prompt robustness: Even minor changes can lead to unpredictable behavior, including hallucinations
Catastrophic forgetting: Agents fail to retain knowledge after being updated with new information

Applications of LLM-based Agents:

Single agents, multi-agent systems, and human-agent interactions
Social sciences, natural sciences, and engineering applications
Robotics and embodied AI, where agents make real-time decisions based on multimodal inputs
Scientific research automation (e.g., experiment design, planning, execution)

Future Research:

Develop more robust multimodal perception systems
Improve LLM inference efficiency
Establish ethical frameworks to guide agent decision-making
Enhance memory integration and refine prompt design

4.9 MLLMs in Graph Learning

Multimodal Large Language Models (MLLMs) in Graph Learning:

Applications:

Increasingly used for graph learning tasks
Outperform traditional Graph Neural Networks (GNNs)
Enhance representational power of GNNs
- Improved performance: classification, prediction, reasoning tasks

Integration with Graphs:

Enhancers: Jin et al. propose this category for MLLMs that enhance the capabilities of graphs [74]
Predictors: MLLMs used to predict graph node properties or relationships between nodes [75, 76]
Alignment Components: Integration of MLLMs with graphs for aligning different data modalities [74]

Advantages:

Utilize textual attributes and other modalities
Improve performance in various graph tasks

4.10 Retrieval-Augmented Generation (RAG) in MLLM

Retrieval-Augmented Generation (RAG) in Multimodal Large Language Models (MLLM)

Background:

MLLMs: remarkable capabilities across modalities like text, images, and audio
Limitations: reliance on static training data hinders accurate responses in rapidly changing contexts

Solution: Retrieval-Augmented Generation (RAG)

Dynamically retrieves relevant external information before generation process
Incorporates real-time and contextually accurate information
Enhances factuality and robustness of MLLM outputs

Benefits:

Improves knowledge richness in natural language processing tasks
Mitigates long-tail knowledge gaps and hallucination issues
Boosts diversity and robustness by leveraging multimodal data sources like images and videos
Generates more accurate and contextually grounded outputs, especially for cross-modal reasoning tasks such as visual question answering and complex dialogue generation.

Applications:

Surveying RAG's application in natural language processing (NLP) [77]
- Effectively mitigates long-tail knowledge gaps and hallucination issues in NLP tasks
Multimodal information retrieval augments generation [79]
- Improves diversity and robustness by leveraging multimodal data sources
- Powers MLLMs to generate more accurate outputs.

5 Major Themes Emerging from Surveys

Key Themes in Multimodal Large Language Models (MLLMs)

5 Major Themes Emerging from Surveys

Architectures of MLLMs: Transformer-based models dominate, with innovations like CLIP, DALL-E, and Flamingo exemplifying progress in aligning text and visual data. Comparison between early fusion (integrating modalities early) and late fusion strategies.
Datasets and Training: Massive, multimodal datasets such as MS-COCO, Visual Genome, and custom-curated sets like LAION. Pretraining on large-scale datasets remains a core strategy; surveys offer comprehensive taxonomy of pretraining methodologies.
Evaluation and Metrics: Evaluation challenges due to limitations of traditional language or vision metrics. Cross-modal retrieval, image captioning, and visual question answering (VQA) are popular benchmarks. Surveys discuss new methods for evaluating multimodal coherence and reasoning.

5.3 Security: Adversarial Attacks

Susceptibility to attacks: MLLMs are susceptible to adversarial attacks that exploit weaknesses in visual inputs, leading to incorrect or harmful text generation.
Types of attacks: Jailbreaking attacks that bypass safety alignment mechanisms, prompt injection, and white-box attacks utilizing gradient information to craft adversarial examples.
Challenges in handling cross-modal perturbations: Exploiting the model's difficulty in handling cross-modal perturbations, making it challenging to detect malicious inputs.
Defense mechanisms: Improving cross-modal alignment algorithms and developing new defense mechanisms to enhance robustness across modalities. Addressing vulnerabilities in the cross-modal alignment process.

5.3.1 Security: Adversarial Attacks - Hallucinations and Data Bias

Hallucination phenomenon: Deviation of generated outputs from visual input, leading to fabrication or misrepresentation of objects or relationships. Often a byproduct of data bias in training sets and limitations of vision encoders.
Causes of hallucinations: Memorization of biases present in training datasets, reliance on noisy or incomplete data, inherent limitations of cross-modal alignment mechanisms.
Impact of hallucinations: Misleads users and exacerbates existing biases in models' outputs, leading to incorrect assumptions about objects' existence, attributes, or relationships.
Addressing hallucinations: Improve modality alignment mechanisms, enhance vision encoder capabilities, employ specialized evaluation benchmarks and techniques like counterfactual data augmentation and post-processing corrections.

5.3.2 Security: Adversarial Attacks - Bias: Hallucinations and Data Bias (cont'd)

Impact on fairness: Hallucinations can reinforce existing biases in models' outputs, leading to incorrect assumptions about objects' existence, attributes, or relationships.
Addressing bias: Enhancing the capabilities of vision encoders, ensuring language models respect visual context constraints, and employing post-processing corrections, counterfactual data augmentation, and diverse visual instructions to generalize better to unseen scenarios.

5.3.3 Fairness: Adversarial Training and Human Feedback

Defense strategies: Employ adversarial training, data augmentation, safety steering vectors, multimodal input validation to strengthen the model's robustness against adversarial inputs. Leverage Reinforcement Learning from Human Feedback (RLHF) to adjust outputs based on user feedback and maintain fairness by incorporating diverse viewpoints and mitigating biases present in training data.
Addressing jailbreaking attacks: Use prompt tuning and gradient-based defenses to reduce the success rate of jailbreaking attacks, ensuring that the model remains secure and aligned with ethical guidelines.

6 Emerging Trends and Gaps

Key Trends in Multimodal Large Language Models (MLLM)

Increased Integration of Multimodality:
- Combination of various data inputs (text, images, audio) for a holistic understanding
- Transformative leap from unimodal to multimodal systems
- Focus on achieving complex, multimodal understanding in applications like autonomous agents and medical diagnostics
Expansion into Diverse Domains:
- MLLMs making significant impacts across industries
- Growing trend towards domain-specific adaptations of these models
Evaluation Metrics and Benchmarking:
- Development of new benchmarks for performance assessment
- Critical need for robust evaluation metrics and standardized assessments
Ethical and Security Considerations:
- Growing concerns over ethical implications and security risks associated with MLLMs
- Importance of responsible development and deployment of these models
Efficiency and Optimization:
- Focus on making MLLMs more accessible and scalable through efficiency improvements

Research Gaps

Integration of Lesser-Known Modalities: haptic feedback, olfactory data, advanced sensory inputs
Longitudinal Studies on Model Performance: understanding the long-term viability and scalability of MLLMs
Cross-Domain Applications and Transfer Learning: effectiveness of transfer learning in different domains; generalization capabilities of MLLMs
Ethical Implications in Non-Textual Modalities: addressing ethical challenges unique to images, video, etc.
Impact of Multilingualism on Multimodality: intersection of multilingualism and multimodality, leading to more inclusive MLLMs.

7 Conclusion

Key Insights on Multimodal Large Language Models (MLLMs)

Multimodal capabilities enhance performance: MLLMs can process data from multiple sources, such as text, images, and videos, improving intelligence in perception and reasoning.
Increased size and training data lead to improved intelligence: As models grow larger and training datasets expand, MLLMs become more intelligent but require greater computational resources.
Challenges remain in robustness, interpretability, and fairness: Practical applications of MLLMs pose challenges related to robustness against various forms of input manipulation, ensuring explainable behavior, and addressing biases.
MLLMs expand potential applications across industries: Rapid technological advancements drive the growth of multimodal AI solutions in numerous sectors.

Future Directions for Surveys on MLLMs

Emerging areas: Explore new trends arising from integration with generative AI and self-supervised learning technologies.
Data diversity and challenges: Deepen discussions on managing complex multimodal data through construction, annotation, and management of large-scale datasets.
Model evaluation and standardization: Systematically analyze evaluation standards, performance metrics, robustness, and fairness of MLLMs in various tasks and domains.
Real-world applications and ethics: Examine real-world applications of MLLMs, addressing issues related to privacy, security, ethical considerations, and balancing innovation with risks.
Optimization of computational resources: Further explore efficient use of resources and model compression techniques for MLLM applications in resource-constrained environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLLM-Landscape_Review-Current-Surveys.md

MLLM-Landscape_Review-Current-Surveys.md

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

Contents

Abstract

1 Introduction

2 Classical Natural Language Processing Methods

3 Large Language Models (LLM) and Multimodal Large Language Models (MLLM)

3.1 Transformer, BERT, and GPT

3.2 Transformer

3.3 Multimodal Large Language Models (MLLM)

4 Taxonomy and Categorization

4.1 Methodology

4.2 Applications and Agents

4.3 Evaluation and Benchmarks

4.4 Efficiency and Adaptation

4.5 Data-centric

4.6 Contunual Learning

4.7 Evaluation Benchmarks

4.8 Agents and Autonomous Systems

4.9 MLLMs in Graph Learning

4.10 Retrieval-Augmented Generation (RAG) in MLLM

5 Major Themes Emerging from Surveys

6 Emerging Trends and Gaps

7 Conclusion

Files

MLLM-Landscape_Review-Current-Surveys.md

Latest commit

History

MLLM-Landscape_Review-Current-Surveys.md

File metadata and controls

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

Contents

Abstract

1 Introduction

2 Classical Natural Language Processing Methods

3 Large Language Models (LLM) and Multimodal Large Language Models (MLLM)

3.1 Transformer, BERT, and GPT

3.2 Transformer

3.3 Multimodal Large Language Models (MLLM)

4 Taxonomy and Categorization

4.1 Methodology

4.2 Applications and Agents

4.3 Evaluation and Benchmarks

4.4 Efficiency and Adaptation

4.5 Data-centric

4.6 Contunual Learning

4.7 Evaluation Benchmarks

4.8 Agents and Autonomous Systems

4.9 MLLMs in Graph Learning

4.10 Retrieval-Augmented Generation (RAG) in MLLM

5 Major Themes Emerging from Surveys

6 Emerging Trends and Gaps

7 Conclusion