by Ming Li, Keyu Chen, Ziqian Bi https://arxiv.org/abs/2409.18991
- Abstract
- 1 Introduction
- 2 Classical Natural Language Processing Methods
- 3 Large Language Models (LLM) and Multimodal Large Language Models (MLLM)
- 4 Taxonomy and Categorization
- 5 Major Themes Emerging from Surveys
- 6 Emerging Trends and Gaps
- 7 Conclusion
Multimodal Large Language Models (MLLMs)
Rise of MLLMs:
- Transformative force in AI field
- Enabling machines to process/generate content across modalities (text, images, audio, video)
- Significant advancement over unimodal systems
Applications of MLLMs:
- Autonomous agents
- Medical diagnostics
- Achieve more holistic understanding of information, mimicking human perception
Importance of Performance Evaluation:
- As capabilities of MLLMs expand
- Comprehensive and accurate evaluation becomes critical
Survey Objectives:
- Provide systematic review of benchmark tests/evaluation methods for MLLMs
- Cover key topics:
- Foundational concepts
- Applications
- Evaluation methodologies
- Ethical concerns
- Security
- Efficiency
- Domain-specific applications
Summary of Existing Literature:
- Classify and analyze existing surveys on MLLMs
- Conduct comparative analysis
- Examine impact within the academic community
Emerging Trends and Underexplored Areas:
- Identify emerging trends in MLLM research
- Propose potential directions for future studies
Survey Intended Usage:
- Offer researchers/practitioners a comprehensive understanding of current state of MLLM evaluation
- Facilitate further progress in the rapidly evolving field
Introduction:
- Purpose of Survey: synthesize key insights from existing MLLM surveys, organize them into 11 core areas
- Goals: provide structured overview of critical advancements, identify trends and challenges, suggest future research directions
Background:
- Multimodal Large Language Models (MLLMs) allow processing diverse modalities, enabling comprehensive applications
- Rapid development, leading to a wealth of surveys on specific aspects
- Complexity and volume of literature make it difficult for researchers and practitioners to grasp the field's state
Approach:
- Synthesize findings from various surveys
- Identify major themes, trends, and challenges in MLLM evaluation
- Examine current methodologies for application and assessment
- Highlight future research directions and underexplored areas within the MLLM landscape.
Natural Language Processing (NLP) Development Stages
- Statistical and rule-based methods: Early focus of NLP research
- Markov process used for sequence data handling
- Hidden Markov Models (HMMs) introduced for continuous speech recognition
- N-gram models established simple yet effective foundation for language models
Neural Networks and Deep Learning
- Long Short-Term Memory (LSTM) networks: Proposed in 1997 to handle long sequential data
- Recurrent Neural Networks (RNNs) further developed for document recognition in 1998
Word Embeddings
- Word2Vec model: Introduced in 2003, capturing semantic relationships between words
- Skip-gram and CBOW models improved quality of word vectors
Deep Learning Advancements
- ImageNet launched for object recognition and image classification tasks (2009)
- Abundant data resources enabled training deep neural networks on large datasets
- AlexNet: Successful in ImageNet competition, improving image classification performance (2012)
Sequence-to-Sequence Learning (Seq2Seq) Model
- Proposed for machine translation using an encoder-decoder structure (2014)
Attention Mechanism
- Introduced by Bahdanau et al. in 2015 to dynamically align source and target sequences
- Improved machine translation performance
Byte Pair Encoding (BPE) Algorithm
- Proposed to handle rare and out-of-vocabulary words through subword units (2016)
ELMo Model
- Introduced in 2018, capturing contextual information through bidirectional LSTM networks
- Improved the quality of word representations by capturing lexical semantics and syntactic information.
Transformer, BERT, and GPT:
- Foundation models of large language models
- Play a crucial role in the field of natural language processing (NLP)
- Drive development of numerous applications and research advancements
Transformer Model
Components:
- Encoder: self-attention mechanism, feed-forward neural network
- Decoder: self-attention mechanism, encoder-decoder attention mechanism, feed-forward neural network
Self-Attention Mechanism:
- Operates on query matrix Q, key matrix K, value matrix V
- Computation: Attention(Q,K,V) = softmax(QKTdk)V
Multi-Head Attention Mechanism:
- Compute multiple self-attention heads in parallel and concatenate results
- Each head computed as: headi = Attention(QWiQ, KWiK, VWiV)
Feed-Forward Neural Network:
- Consists of two linear transformations and a ReLU activation function
- Helps transform input data into more abstract representation before passing to next layer
Transformer Variants:
- BERT (Bidirectional Encoder Representations from Transformers)
- Generates contextual representations of words using bidirectional encoder
- Unsupervised pre-training on large corpora, followed by fine-tuning for specific tasks
- Pre-training tasks: Masked Language Model (MLM), Next Sentence Prediction (NSP)
- GPT (Generative Pretrained Transformer)
- Mainly uses unidirectional decoder to generate text
- Unsupervised pre-training on large corpora, followed by fine-tuning for specific tasks
- Training consists of two stages: pre-training and fine-tuning
- Pre-training uses autoregressive method to predict next word, learning sequence information of words.
Multimodal Large Language Models (MLLM)
Overview:
- Significant advancement in AI field
- Integrates multiple modalities: text, images, audio
- Enhances versatility and applicability to various tasks
- Core challenge: effective modality alignment
Modality Alignment:
- Achieved through mapping different types of data into a common representation space
- Improves performance on image captioning, visual question answering, multimodal translation
- Enables more natural and comprehensive human-computer interactions
Early Methods:
- Manually labeled data: costly and inefficient
- Examples: DeViSE, VSE, CCA
- Performance limited due to restricted dataset size
Contrastive Learning-based Multimodal Alignment:
- Bring similar sample pairs closer; push dissimilar pairs apart in shared representation space
- Improved performance through large unlabeled image datasets and contrastive loss functions
CLIP (Contrastive Language-Image Pre-Training):
- Major breakthrough in multimodal data alignment
- Leverages large-scale dataset of image-text pairs
- Employs a contrastive loss function for robust text-image alignment
- Encodes images and texts into common representation space: high similarity for matched pairs, low similarity for mismatched pairs
- Pretraining on hundreds of millions of image-text pairs enables rich cross-modal representations
- Strong zero-shot learning capabilities, can be directly used for feature extraction and applications without fine-tuning
- Text and image encoders can be used independently in various downstream tasks like METER, SimVLM, BLIP-2, Stable Diffusion series.
Model-Based Techniques for Multimodal Alignment
ALIGN (Large-scale Image-Language Pretraining):
- Employs a dual-encoder architecture that aligns images and text in a shared embedding space using contrastive learning
- Significantly improves zero-shot image classification and retrieval
Florence:
- Incorporates a vision transformer model with a text encoder to align image and text modalities
- Demonstrates superior performance in visual understanding tasks
Model-Based Approaches:
- Focus on pretraining using vast amounts of weakly labeled image-text data
- Contribute to better generalization across a range of vision-language benchmarks
UNITER (UNiversal Image-TExt Representation):
- Jointly learns image-text alignments using a transformer-based architecture
- Introduces a unique masked language modeling task to improve contextual understanding between modalities
ViLT (Vision-and-Language Transformer):
- Reduces the complexity of multimodal fusion by directly feeding visual patch embeddings into a transformer
- Bypasses the need for separate image encoders while maintaining competitive performance in tasks like image-text retrieval and visual question answering
Model-Based Techniques:
- Highlight the versatility and scalability of transformer-based approaches in multimodal alignment tasks
- Address both high-level semantic matching and fine-grained object-level alignment
Survey Synthesis Paper on Latest Machine Learning and Language Models (MLLM)
Methodology
- This paper consolidates findings from 58 most recent and advanced surveys in MLLM domain
- Surveys are categorized into key themes: model architectures, evaluation, applications, security, bias, future directions.
- Selection criteria: recency and broad coverage of MLLM domain
Analysis of Each Survey
- Technical focus: architectures, models, datasets
- Applications: computer vision, healthcare, robotics etc.
- Security and biases: model safety, fairness, robustness.
- Emerging trends: future directions, new paradigms.
Applications and Agents
Legal Domain:
- LLMs used for tasks like legal advice, document processing, and judicial assistance
- Specialized LLMs, such as LawGPT and LexiLaw, address legal language and reasoning nuances
- Challenges: Maintaining judicial independence, ethical implications of biased data in legal decision-making
Autonomous Driving:
- MLLMs used to improve perception, decision-making, human-vehicle interaction
- Key trend: Integration of multimodal data (LiDAR, maps, images) enhances vehicle's ability to process complex driving environments
- Challenges: Real-time data processing, safety in diverse driving conditions
Mathematics:
- LLMs used for mathematical tasks like calculation and reasoning
- Technique like Chain-of-Thought (CoT) prompting improves model performance
- Scarcity of high-quality datasets, complexity of mathematical reasoning are ongoing challenges
Healthcare:
- Multimodal learning used for image fusion, report generation, cross-modal retrieval
- Rise of foundation models like GPT-4 and CLIP in processing medical data
- Advancements not yet achieved universal intelligence; concerns related to data integration, ethical considerations remain barriers
Robotics:
- LLMs used to improve perception, decision-making, control in robotic systems
- Key trend: Potential for LLMs to advance embodied intelligence, where robots understand and interact with the physical world
- Challenges: Real-time perception, control, integration with existing technologies
Multilingualism:
- LLMs can process multiple languages, but challenges remain for low-resource languages and security issues in multilingual models
- Techniques like multilingual Chain-of-Thought reasoning show promise for future development
Gaming:
- LLMs used to generate dynamic Non-Player Characters (NPCs), enhance player interaction, assist in game design
- Challenge: Hallucinations where LLMs generate plausible but incorrect outputs are a major limitation
- Improving context management and memory within gaming systems is a future research priority
Audio Processing:
- LLMs used for Automatic Speech Recognition (ASR) and music generation
- Key trend: Integration of multimodal data from speech, music, and environmental sounds into a single model
- Challenges: Scalability, data diversity
Video Understanding:
- Vid-LLMs combine video, text, and audio inputs to analyze and understand video content
- Promising for tasks like video summarization and captioning
- Challenges: Processing long videos, maintaining contextual coherence
Citation Analysis:
- LLMs significantly improve citation recommendation, classification, and summarization tasks
- Citation data enhances LLM performance by enabling multi-hop knowledge across documents
- Future research needs to address the expansion of citation networks and integration of non-academic sources.
Evaluation and Benchmarks of Multimodal Large Language Models (MLLMs)
Importance: Evaluation and benchmarking are crucial for understanding MLLM performance across a diverse range of tasks and datasets.
Categorizing Evaluations
- Perception and Understanding: Interpreting multimodal inputs, integrating information across modalities (VQA datasets like VQAv2).
- Cognition and Reasoning: Logical reasoning, problem-solving, multimodal reasoning (complex visual question answering).
- Robustness and Safety: Performing under adversarial prompts or out-of-distribution data. Managing hallucinations and biases.
- Domain-Specific Capabilities: Medical image interpretation, legal text analysis (TallyQA).
- Task-Specific Benchmarks: Traditional vs. advanced datasets (VQAv2 vs. TDIUC).
- Multimodal Reasoning and Interaction: Evaluating interaction between modalities (multimodal dialogue, visual question answering).
- Ethical and Societal Implications: Ensuring fairness and avoiding societal biases. Ensuring trustworthiness and safety (handling uncertainty, avoiding hallucinations).
Micro vs. Macro Performance: Analyze both micro performance (equal weight) and macro performance (averaged) for a clearer understanding of model strengths and weaknesses.
Generalization Across Tasks: Evaluate models not only on specific tasks but also on their ability to generalize across different question types.
MLLM Surveys: Efficiency and Adaptation
Efficiency Approaches for MLLMs:
- Xu et al. [62] and Jin et al. [63] emphasize the need for accessibility in resource-constrained environments
- Comprehensive taxonomies cover:
- Advancements in architectures
- Vision-language integration
- Training methods
- Benchmarks that optimize MLLM efficiency
- Key methods discussed:
- Vision token compression
- Parameter-efficient fine-tuning
- Transformer-alternative models: Mixture of Experts (MoE), state space models
- Balancing computational efficiency with task performance is the goal
Adapting MLLMs to Specific Tasks:
- Liu et al. [64] address adapting MLLMs to specific tasks with limited labeled data
- Categorizes approaches into:
- Prompt-based methods
- Adapter-based methods
- External knowledge-based methods
- These approaches help models generalize better in fine-grained domains
- Few-shot adaptation techniques: visual prompt tuning, adapter fine-tuning are critical for extending the usability of large multimodal models
Challenges Ahead:
- Domain adaptation, model selection, and integration of external knowledge remain ongoing challenges
- Future: MLLMs become more efficient, flexible, and adaptive in handling diverse real-world tasks.
Bai et al.'s Survey on Data-Centric Approach for Multimodal Large Language Models (MLLM)
Importance of Quality, Diversity, and Volume of Multimodal Data:
- Emphasizes the significance of high-quality, diverse, and abundant multimodal data (text, images, audio, video) in training MLLMs effectively.
Challenges in Multimodal Data Collection:
- Identifies issues: data sparsity and noise
- Potential solutions: synthetic data generation and active learning to mitigate these challenges.
Advocating for a More Data-Centric Approach:
- Refinement and curation of data take precedence over model development.
- Improves model performance and advances MLLM capabilities in complex landscapes.
Yang et al.'s Research on Large Language Models (LLMs) and Code:
Role of Code in LLMs:
- Enhances LLMs’ capacity to function as intelligent agents through automation, code generation, software development tasks.
Challenges Around Code Integration:
- Correctness: ensuring the accuracy of generated or manipulated code.
- Efficiency: optimizing resources for large-scale code processing.
- Security: protecting sensitive data during model training and deployment.
Potential of LLMs to Become Autonomous Agents:
- Managing complex tasks across various domains.
- Broadening the scope and impact of LLMs in AI-driven innovations.
Large Language Models (LLMs) Integration with External Knowledge
Integration Methods:
- Knowledge editing: Modifying input or model to update outdated/incorrect information
- Retrieval augmentation: Fetching external information during inference without altering core parameters
Approaches and Taxonomy:
- Categorizes integration methods into knowledge editing and retrieval augmentation
- Benchmarks for evaluation, applications: LangChain, ChatDoctor
- Handling knowledge conflicts, future research directions
Continual Learning (CL) of LLMs:
- Shi et al. survey: Extensive overview of CL challenges and methodologies in the field
- Two primary directions: vertical continual learning (general to specific domains), horizontal continual learning (across time across various domains)
- Problems: "catastrophic forgetting" (models lose knowledge), maintaining performance on old and new tasks
CL Methods:
- Continual pre-training
- Domain-adaptive pre-training
- Continual fine-tuning
Mitigating Forgetting:
- Replay-based methods: Reusing samples from previous tasks
- Regularization-based methods: Adding constraints to prevent forgetting
- Architecture-based methods: Designing model structures that improve knowledge retention
Call for Research:
- Evaluation benchmarks and methodologies to counter forgetting
- Supporting knowledge transfer in LLMs
Multimodal Large Language Models (MLLMs) Evaluation Benchmarks
Challenges of Evaluating MLLMs:
- Lu et al. [55] discuss the complexities of evaluating MLLMs, focusing on tasks like cross-modal retrieval, caption generation, and Visual Question Answering (VQA)
- Current evaluation frameworks fail to capture the nuances of multimodal interactions
Need for Tailored Evaluation Metrics:
- Li and Lu [69] argue the absence of standardized protocols across different modalities hinders fair comparison of models
- They call for establishment of consistent evaluation frameworks that ensure reliable and equitable performance assessment
Evaluation Methodologies for Large Language Models (LLMs):
- Chang et al. [57] examine evaluation methodologies for LLMs, emphasizing the importance of moving beyond task-specific performance metrics
- They point out existing benchmarks overlook critical issues like hallucinations, fairness, and societal implications
- The authors advocate for more holistic evaluation practices that consider both technical capabilities and trustworthiness, robustness, and ethical alignment in real-world applications.
Autonomous Agents Leveraging Large Language Models (LLMs)
Components of Autonomous Agents:
- Perception: Agents perceive their environment
- Memory: Agents recall previous interactions
- Planning: Agents plan actions in real-time
- Action: Agents execute actions
Benefits of this Architecture:
- Highly adaptable across various domains (e.g., digital assistants, autonomous vehicles)
- Incorporating multimodal inputs expands the perception-action loop
Challenges:
- Knowledge boundaries: Agents are constrained in specialized or underexplored domains
- Prompt robustness: Even minor changes can lead to unpredictable behavior, including hallucinations
- Catastrophic forgetting: Agents fail to retain knowledge after being updated with new information
Applications of LLM-based Agents:
- Single agents, multi-agent systems, and human-agent interactions
- Social sciences, natural sciences, and engineering applications
- Robotics and embodied AI, where agents make real-time decisions based on multimodal inputs
- Scientific research automation (e.g., experiment design, planning, execution)
Future Research:
- Develop more robust multimodal perception systems
- Improve LLM inference efficiency
- Establish ethical frameworks to guide agent decision-making
- Enhance memory integration and refine prompt design
Multimodal Large Language Models (MLLMs) in Graph Learning:
Applications:
- Increasingly used for graph learning tasks
- Outperform traditional Graph Neural Networks (GNNs)
- Enhance representational power of GNNs
- Improved performance: classification, prediction, reasoning tasks
Integration with Graphs:
- Enhancers: Jin et al. propose this category for MLLMs that enhance the capabilities of graphs [74]
- Predictors: MLLMs used to predict graph node properties or relationships between nodes [75, 76]
- Alignment Components: Integration of MLLMs with graphs for aligning different data modalities [74]
Advantages:
- Utilize textual attributes and other modalities
- Improve performance in various graph tasks
Retrieval-Augmented Generation (RAG) in Multimodal Large Language Models (MLLM)
Background:
- MLLMs: remarkable capabilities across modalities like text, images, and audio
- Limitations: reliance on static training data hinders accurate responses in rapidly changing contexts
Solution: Retrieval-Augmented Generation (RAG)
- Dynamically retrieves relevant external information before generation process
- Incorporates real-time and contextually accurate information
- Enhances factuality and robustness of MLLM outputs
Benefits:
- Improves knowledge richness in natural language processing tasks
- Mitigates long-tail knowledge gaps and hallucination issues
- Boosts diversity and robustness by leveraging multimodal data sources like images and videos
- Generates more accurate and contextually grounded outputs, especially for cross-modal reasoning tasks such as visual question answering and complex dialogue generation.
Applications:
- Surveying RAG's application in natural language processing (NLP) [77]
- Effectively mitigates long-tail knowledge gaps and hallucination issues in NLP tasks
- Multimodal information retrieval augments generation [79]
- Improves diversity and robustness by leveraging multimodal data sources
- Powers MLLMs to generate more accurate outputs.
Key Themes in Multimodal Large Language Models (MLLMs)
5 Major Themes Emerging from Surveys
- Architectures of MLLMs: Transformer-based models dominate, with innovations like CLIP, DALL-E, and Flamingo exemplifying progress in aligning text and visual data. Comparison between early fusion (integrating modalities early) and late fusion strategies.
- Datasets and Training: Massive, multimodal datasets such as MS-COCO, Visual Genome, and custom-curated sets like LAION. Pretraining on large-scale datasets remains a core strategy; surveys offer comprehensive taxonomy of pretraining methodologies.
- Evaluation and Metrics: Evaluation challenges due to limitations of traditional language or vision metrics. Cross-modal retrieval, image captioning, and visual question answering (VQA) are popular benchmarks. Surveys discuss new methods for evaluating multimodal coherence and reasoning.
5.3 Security: Adversarial Attacks
- Susceptibility to attacks: MLLMs are susceptible to adversarial attacks that exploit weaknesses in visual inputs, leading to incorrect or harmful text generation.
- Types of attacks: Jailbreaking attacks that bypass safety alignment mechanisms, prompt injection, and white-box attacks utilizing gradient information to craft adversarial examples.
- Challenges in handling cross-modal perturbations: Exploiting the model's difficulty in handling cross-modal perturbations, making it challenging to detect malicious inputs.
- Defense mechanisms: Improving cross-modal alignment algorithms and developing new defense mechanisms to enhance robustness across modalities. Addressing vulnerabilities in the cross-modal alignment process.
5.3.1 Security: Adversarial Attacks - Hallucinations and Data Bias
- Hallucination phenomenon: Deviation of generated outputs from visual input, leading to fabrication or misrepresentation of objects or relationships. Often a byproduct of data bias in training sets and limitations of vision encoders.
- Causes of hallucinations: Memorization of biases present in training datasets, reliance on noisy or incomplete data, inherent limitations of cross-modal alignment mechanisms.
- Impact of hallucinations: Misleads users and exacerbates existing biases in models' outputs, leading to incorrect assumptions about objects' existence, attributes, or relationships.
- Addressing hallucinations: Improve modality alignment mechanisms, enhance vision encoder capabilities, employ specialized evaluation benchmarks and techniques like counterfactual data augmentation and post-processing corrections.
5.3.2 Security: Adversarial Attacks - Bias: Hallucinations and Data Bias (cont'd)
- Impact on fairness: Hallucinations can reinforce existing biases in models' outputs, leading to incorrect assumptions about objects' existence, attributes, or relationships.
- Addressing bias: Enhancing the capabilities of vision encoders, ensuring language models respect visual context constraints, and employing post-processing corrections, counterfactual data augmentation, and diverse visual instructions to generalize better to unseen scenarios.
5.3.3 Fairness: Adversarial Training and Human Feedback
- Defense strategies: Employ adversarial training, data augmentation, safety steering vectors, multimodal input validation to strengthen the model's robustness against adversarial inputs. Leverage Reinforcement Learning from Human Feedback (RLHF) to adjust outputs based on user feedback and maintain fairness by incorporating diverse viewpoints and mitigating biases present in training data.
- Addressing jailbreaking attacks: Use prompt tuning and gradient-based defenses to reduce the success rate of jailbreaking attacks, ensuring that the model remains secure and aligned with ethical guidelines.
Key Trends in Multimodal Large Language Models (MLLM)
- Increased Integration of Multimodality:
- Combination of various data inputs (text, images, audio) for a holistic understanding
- Transformative leap from unimodal to multimodal systems
- Focus on achieving complex, multimodal understanding in applications like autonomous agents and medical diagnostics
- Expansion into Diverse Domains:
- MLLMs making significant impacts across industries
- Growing trend towards domain-specific adaptations of these models
- Evaluation Metrics and Benchmarking:
- Development of new benchmarks for performance assessment
- Critical need for robust evaluation metrics and standardized assessments
- Ethical and Security Considerations:
- Growing concerns over ethical implications and security risks associated with MLLMs
- Importance of responsible development and deployment of these models
- Efficiency and Optimization:
- Focus on making MLLMs more accessible and scalable through efficiency improvements
Research Gaps
- Integration of Lesser-Known Modalities: haptic feedback, olfactory data, advanced sensory inputs
- Longitudinal Studies on Model Performance: understanding the long-term viability and scalability of MLLMs
- Cross-Domain Applications and Transfer Learning: effectiveness of transfer learning in different domains; generalization capabilities of MLLMs
- Ethical Implications in Non-Textual Modalities: addressing ethical challenges unique to images, video, etc.
- Impact of Multilingualism on Multimodality: intersection of multilingualism and multimodality, leading to more inclusive MLLMs.
Key Insights on Multimodal Large Language Models (MLLMs)
- Multimodal capabilities enhance performance: MLLMs can process data from multiple sources, such as text, images, and videos, improving intelligence in perception and reasoning.
- Increased size and training data lead to improved intelligence: As models grow larger and training datasets expand, MLLMs become more intelligent but require greater computational resources.
- Challenges remain in robustness, interpretability, and fairness: Practical applications of MLLMs pose challenges related to robustness against various forms of input manipulation, ensuring explainable behavior, and addressing biases.
- MLLMs expand potential applications across industries: Rapid technological advancements drive the growth of multimodal AI solutions in numerous sectors.
Future Directions for Surveys on MLLMs
- Emerging areas: Explore new trends arising from integration with generative AI and self-supervised learning technologies.
- Data diversity and challenges: Deepen discussions on managing complex multimodal data through construction, annotation, and management of large-scale datasets.
- Model evaluation and standardization: Systematically analyze evaluation standards, performance metrics, robustness, and fairness of MLLMs in various tasks and domains.
- Real-world applications and ethics: Examine real-world applications of MLLMs, addressing issues related to privacy, security, ethical considerations, and balancing innovation with risks.
- Optimization of computational resources: Further explore efficient use of resources and model compression techniques for MLLM applications in resource-constrained environments.