Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models
• arXiv 2025 •
Five multimodal adaptation scenarios are discussed in this survey. (a) Multimodal domain adaptation, (b) Multimodal test-time adaptation, and (c) Multimodal domain generalization, which represent traditional multimodal settings with varying access to source and target domain data. Additionally, we examine more recent foundation models including (d) unimodal domain adaptation and generalization assisted by multimodal foundation models, and (e) the adaptation of multimodal foundation models to downstream tasks.
🤗 Contributions to new resources and articles are always welcome!
If you find our work useful in your research please consider citing our paper:
@article{dong2025mmdasurvey,
author = {Dong, Hao and Liu, Moru and Zhou, Kaiyang and Chatzi, Eleni and Kannala, Juho and Stachniss, Cyrill and Fink, Olga},
title = {Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models},
journal = {arXiv preprint arXiv:2501.18592},
year = {2025},
}
- Citation
- Table of Contents
- Multimodal Domain Adaptation
- Multimodal Test-time Adaptation
- Multimodal Domain Generalization
- Domain Adaptation and Generalization with the Help of Multimodal Foundation Models
- Adaptation of Multimodal Foundation Models
(ACM MM 2018) A Unified Framework for Multimodal Domain Adaptation by Qi et al.
(CVPR 2020) Multi-Modal Domain Adaptation for Fine-Grained Action Recognition [Code] by Munro et al.
(CVPR 2021) Spatio-temporal Contrastive Domain Adaptation for Action Recognition by Song et al.
(ICCV 2021) Learning Cross-modal Contrastive Features for Video Domain Adaptation by Kim et al.
(TIP 2021) Progressive Modality Cooperation for Multi-Modality Domain Adaptation by Zhang et al.
(ACM MM 2021) Differentiated Learning for Multi-Modal Domain Adaptation by Lv et al.
(CVPR 2022) Audio-Adaptive Activity Recognition Across Video Domains [Code] by Zhang et al.
(CVPR 2022) Interact before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition by Yang et al.
(ACM MM 2022) Relative Alignment Network for Source-Free Multimodal Video Domain Adaptation by Huang et al.
(ACM MM 2022) Mix-DANN and Dynamic-Modal-Distillation for Video Domain Adaptation by Yin et al.
(ECCV 2024) Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision [Code] by Dong et al.
(CVPR 2020) xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation [Code] by Jaritz et al.
(ICCV 2021) Sparse-to-dense Feature Matching: Intra and Inter domain Cross-modal Learning in Domain Adaptation for 3D Semantic Segmentation [Code] by Peng et al.
(ISPRS 2021) Adversarial unsupervised domain adaptation for 3D semantic segmentation with multi-modal learning by Liu et al.
(ECCV 2022) Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation [Code] by Vobecky et al.
(TPAMI 2022) Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation [Code] by Jaritz et al.
(ACM MM 2022) Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation by Li et al.
(ACM MM 2022) Self-supervised Exclusive Learning for 3D Segmentation with Cross-modal Unsupervised Domain Adaptation by Zhang et al.
(ICCV 2023) CrossMatch: Source-Free Domain Adaptive Semantic Segmentation via Cross-Modal Consistency Training by Yin et al.
(ICCV 2023) SUMMIT: Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets [Code] by Simons et al.
(AAAI 2023) Cross-Modal Contrastive Learning for Domain Adaptation in 3D Semantic Segmentation by Xing et al.
(RAS 2023) Real-time multi-modal semantic fusion on unmanned aerial vehicles with label propagation for cross-domain adaptation by Bultmann et al.
(CVPRW 2023) Exploiting the Complementarity of 2D and 3D Networks to Address Domain-Shift in 3D Semantic Segmentation by Cardace et al.
(IROS 2023) DualCross: Cross-Modality Cross-Domain Adaptation for Monocular BEV Perception [Code] by Man et al.
(ACM MM 2023) Cross-modal & Cross-domain Learning for Unsupervised LiDAR Semantic Segmentation by Chen et al.
(ACM MM 2023) Cross-modal Unsupervised Domain Adaptation for 3D Semantic Segmentation via Bidirectional Fusion-then-Distillation by Wu et al.
(AAAI 2023) Mx2M: Masked Cross-Modality Modeling in Domain Adaptation for 3D Semantic Segmentation by Zhang et al.
(ICRA 2024) MoPA: Multi-Modal Prior Aided Domain Adaptation for 3D Semantic Segmentation [Code] by Cao et al.
(ECCV 2024) MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation by Yang et al.
(NeurIPS 2024) UniDSeg: Unified Cross-Domain 3D Semantic Segmentation via Visual Foundation Models Prior by Wu et al.
(ACM MM 2024) CLIP2UDA: Making Frozen CLIP Reward Unsupervised Domain Adaptation in 3D Semantic Segmentation by Wu et al.
(ICRA 2025) SAM-guided Pseudo Label Enhancement for Multi-modal 3D Semantic Segmentation by Yang et al.
(TMM 2019) Deep Multi-Modality Adversarial Networks for Unsupervised Domain Adaptation by Ma et al.
(JBHI 2022) A Novel 3D Unsupervised Domain Adaptation Framework for Cross-Modality Medical Image Segmentation by Yao et al.
(CVPR 2023) OSAN: A One-Stage Alignment Network to Unify Multimodal Alignment and Unsupervised Domain Adaptation by Liu et al.
(ACL 2024) Amanda: Adaptively Modality-Balanced Domain Adaptation for Multimodal Emotion Recognition by Zhang et al.
(WACVW 2024) Source-Free Domain Adaptation for RGB-D Semantic Segmentation with Vision Transformers by Rizzoli et al.
(ICLR 2024) Test-time Adaptation against Multi-modal Reliability Bias [Code] by Yang et al.
(CVPR 2024) Modality-Collaborative Test-Time Adaptation for Action Recognition by Xiong et al.
(ICMLW 2024) Two-Level Test-Time Adaptation in Multimodal Learning by Lei et al.
(ICLR 2025) Towards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization [Code] by Dong et al.
(ICLR 2025) Test-Time Adaptation for Combating Missing Modalities in Egocentric Videos by Ramazanova et al.
(ICLR 2025) Smoothing the Shift: Towards Stable Test-time Adaptation under Complex Multimodal Noises by Guo et al.
(CVPR 2022) MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation by Shin et al.
(CVPR 2023) Multi-Modal Continual Test-Time Adaptation for 3D Semantic Segmentation by Cao et al.
(ECCV 2024) Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation [Code] by Cao et al.
(AAAI 2024) Heterogeneous Test-Time Training for Multi-Modal Person Re-identification by Wang et al.
(CVPR 2024) Test-Time Adaptation for Depth Completion [Code] by Park et al.
(WACV 2022) Domain Generalization through Audio-Visual Relative Norm Alignment in First Person Action Recognition by Planamente et al.
(NeurIPS 2023) SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization [Code] by Dong et al.
(ECCV 2024) Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision [Code] by Dong et al.
(IJCV 2024) Relative Norm Alignment for Tackling Domain Shift in Deep Multi-modal Classification by Planamente et al.
(NeurIPS 2024) Cross-modal Representation Flattening for Multi-modal Domain Generalization [Code] by Fan et al.
(ICCV 2023) BEV-DG: Cross-Modal Learning under Bird’s-Eye View for Domain Generalization of 3D Semantic Segmentation by Li et al.
(ICLR 2023) Using Language to Extend to Unseen Domains [Code] by Dunlap et al.
(CVPR 2023) CLIP the Gap: A Single Domain Generalization Approach for Object Detection by Vidit et al.
(ICCV 2023) PØDA: Prompt-driven Zero-shot Domain Adaptation [Code] by Fahes et al.
(ICCV 2023) PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization by Cho et al.
(ICMLW 2024) Leveraging Generative Foundation Models for Domain Generalization by Hemati et al.
(CVPR 2024) Collaborating Foundation Models for Domain Generalized Semantic Segmentation [Code] by Benigmim et al.
(CVPR 2024) Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization [Code] by Singha et al.
(CVPR 2024) Unified Language-driven Zero-shot Domain Adaptation [Code] by Yang et al.
(ECCV 2024) DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control [Code] by Jia et al.
(ICCV 2023) A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance by Huang et al.
(ICCV 2023) The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation [Code] by Zara et al.
(ICCV 2023) Distilling Large Vision-Language Model with Out-of-Distribution Generalizability [Code] by Li et al.
(CVPR 2024) PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization by Chen et al.
(CVPR 2024) Source-Free Domain Adaptation with Frozen Multimodal Foundation Model [Code] by Tang et al.
(CVPR 2024) Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification by Addepalli et al.
(ECCV 2024) Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation [Code] by Mistretta et al.
(arXiv 2021) Domain Prompt Learning for Efficiently Adapting CLIP to Unseen Domains [Code] by Zhang et al.
(arXiv 2022) Prompt Vision Transformer for Domain Generalization [Code] by Zheng et al.
(ECCV 2022) Domain Generalization by Mutual-Information Regularization with Pre-trained Models [Code] by Cha et al.
(ICLR 2022) Optimal Representations for Covariate Shift by Ruan et al.
(TNNLS 2023) Domain Adaptation via Prompt Learning [Code] by Ge et al.
(ICML 2023) CLIPood: Generalizing CLIP to Out-of-Distributions [Code] by Shu et al.
(CVPR 2023) Back to the Source: Diffusion-Driven Adaptation to Test-Time Corruption [Code] by Gao et al.
(CVPR 2023) AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation [Code] by Zara et al.
(ICCV 2023) PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation by Lai et al.
(NeurIPS 2023) Multi-Prompt Alignment for Multi-Source Unsupervised Domain Adaptation by Chen et al.
(NeurIPS 2023) Diffusion-TTA: Test-time Adaptation of Discriminative Models via Generative Feedback [Code] by Prabhudesai et al.
(NeurIPS 2023) Diffusion-Based Probabilistic Uncertainty Estimation for Active Domain Adaptation by Du et al.
(CVPR 2024) Disentangled Prompt Representation for Domain Generalization by Cheng et al.
(CVPR 2024) Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation [Code] by Wei et al.
(CVPR 2024) Unknown Prompt, the only Lacuna: Unveiling CLIP’s Potential for Open Domain Generalization [Code] by Singha et al.
(CVPR 2024) Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation [Code] by Li et al.
(CVPR 2024) Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization by Li et al.
(CVPR 2024) Any-Shift Prompting for Generalization over Distributions by Xiao et al.
(CVPRW 2024) Exploring the Benefits of Vision Foundation Models for Unsupervised Domain Adaptation [Code] by Englert et al.
(ECCV 2024) Learning to Adapt SAM for Segmenting Cross-domain Point Clouds by Peng et al.
(ECCV 2024) Learning Representations from Foundation Models for Domain Generalized Stereo Matching by Zhang et al.
(ECCV 2024) Cross-Domain Semantic Segmentation on Inconsistent Taxonomy using VLMs by Lim et al.
(ECCV 2024) CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation [Code] by Shim et al.
(ECCV 2024) Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems [Code] by Chung et al.
(ECCV 2024) Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation [Code] by Oh et al.
(ECCV 2024) HVCLIP: High-dimensional Vector in CLIP for Unsupervised Domain Adaptation by Vesdapunt et al.
(ECCV 2024) Soft Prompt Generation for Domain Generalization [Code] by Bai et al.
(WACV 2024) Empowering Unsupervised Domain Adaptation with Large-scale Pre-trained Vision-Language Models by Lai et al.
(WACV 2024) ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation by Hu et al.
(IJCV 2024) Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-Training [Code] by Zhang et al.
(NeurIPS 2024) Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [Code] by Xia et al.
(NeurIPS 2024) UniDSeg: Unified Cross-Domain 3D Semantic Segmentation via Visual Foundation Models Prior by Wu et al.
(NeurIPS 2024) CLIPCEIL: Boosting Domain Generalization for CLIP by Channel rEfinement and Image-text aLignment by Yu et al.
(ACM MM 2024) CLIP2UDA: Making Frozen CLIP Reward Unsupervised Domain Adaptation in 3D Semantic Segmentation by Wu et al.
(arXiv 2024) Visual Foundation Models Boost Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation [Code] by Xu et al.
(arXiv 2024) Open-Set Domain Adaptation with Visual-Language Foundation Models by Yu et al.
(arXiv 2024) CLIP the Divergence: Language-guided Unsupervised Domain Adaptation by Zhu et al.
(arXiv 2024) Transitive Vision-Language Prompt Learning for Domain Generalization by Chen et al.
(IJCV 2022) Learning to Prompt for Vision-Language Models [Code] by Zhou et al.
(CVPR 2022) Conditional Prompt Learning for Vision-Language Models [Code] by Zhou et al.
(CVPR 2022) Prompt Distribution Learning by Lu et al.
(CVPR 2022) DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [Code] by Rao et al.
(NeurIPS 2022) DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations [Code] by Sun et al.
(NeurIPS 2022) Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models by Shu et al.
(EMNLP 2022) CPL: Counterfactual Prompt Learning for Vision and Language Models [Code] by He et al.
(arXiv 2022) Prompt Tuning with Soft Context Sharing for Vision-Language Models [Code] by Ding et al.
(arXiv 2022) Unsupervised Prompt Learning for Vision-Language Models [Code] by Huang et al.
(arXiv 2022) Unified Vision and Language Prompt Learning by Zang et al.
(arXiv 2022) Exploring Visual Prompts for Adapting Large-Scale Models [Code] by Bahng et al.
(ICLR 2023) PLOT: Prompt Learning with Optimal Transport for Vision-Language Models [Code] by Chen et al.
(CVPR 2023) LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models by BulatMa et al.
(CVPR 2023) Texts as Images in Prompt Tuning for Multi-Label Image Recognition [Code] by Guo et al.
(CVPR 2023) Visual-Language Prompt Tuning with Knowledge-guided Context Optimization [Code] by Yao et al.
(CVPR 2023) MaPLe: Multi-modal Prompt Learning [Code] by Khattak et al.
(ICCV 2023) Prompt-aligned Gradient for Prompt Tuning [Code] by Zhu et al.
(ICCV 2023) Self-regulating Prompts: Foundational Model Adaptation without Forgetting [Code] by Khattak et al.
(ICCV 2023) Bayesian Prompt Learning for Image-Language Model Generalization [Code] by Derakhshani et al.
(ICCV 2023) Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning [Code] by Feng et al.
(NeurIPS 2023) Benchmarking robustness of adaptation methods on pre-trained vision-language models [Code] by Chen et al.
(NeurIPS 2023) SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models by Ma et al.
(NeurIPS 2023) Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization [Code] by Hassan et al.
(TCSVT 2023) Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models [Code] by Ma et al.
(TMM 2023) Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model [Code] by Xing et al.
(ICLRW 2023) Variational Prompt Tuning Improves Generalization of Vision-Language Models by Derakhshani et al.
(arXiv 2023) Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification by Rong et al.
(WACV 2024) Multitask Vision-Language Prompt Tuning [Code] by Shen et al.
(CVPR 2024) ProTeCt: Prompt Tuning for Taxonomic Open Set Classification by Wu et al.
(ECCV 2024) Cascade Prompt Learning for Vision-Language Model Adaptation by Wu et al.
(ECCV 2024) Quantized Prompt for Efficient Generalization of Vision-Language Models [Code] by Hao et al.
(NeurIPS 2024) Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP by Huang et al.
(TMLR 2024) Unleashing the Power of Visual Prompting At the Pixel Level [Code] by Wu et al.
(arXiv 2021) CLIP-Adapter: Better Vision-Language Models with Feature Adapters [Code] by Gao et al.
(ECCV 2022) Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling [Code] by Zhang et al.
(BMVC 2022) SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models [Code] by Pantazis et al.
(arXiv 2022) Improving Zero-Shot Models with Label Distribution Priors by Kahana et al.
(ICCV 2023) SuS-X: Training-Free Name-Only Transfer of Vision-Language Models [Code] by Udandarao et al.
(ICCVW 2023) SAM-Adapter: Adapting Segment Anything in Underperformed Scenes [Code] by Chen et al.
(TMM 2023) SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification by Peng et al.
(CVPR 2024) Efficient Test-Time Adaptation of Vision-Language Models by Karmanov et al.
(ECCV 2024) Improving Zero-Shot Generalization for CLIP with Variational Adapter by Lu et al.
(ECCV 2024) CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model by Xiao et al.
(MIA 2024) MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image Segmentation [Code] by Chen et al.
(arXiv 2021) VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts by Qiu et al.
(CVPR 2022) Robust fine-tuning of zero-shot models [Code] by Wortsman et al.
(ECCV 2022) Extract Free Dense Labels from CLIP [Code] by Zhou et al.
(AAAI 2023) CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [Code] by Guo et al.
(CVPR 2023) Task Residual for Tuning Vision-Language Models [Code] by Yu et al.
(CVPR 2023) Improving Zero-shot Generalization and Robustness of Multi-modal Models [Code] by Ge et al.
(CVPR 2023) Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models by Lin et al.
(ICLR 2023) Visual Classification via Description from Large Language Models [Code] by Menon et al.
(ICLR 2023) Masked Unsupervised Self-training for Label-free Image Classification [Code] by Li et al.
(ICCV 2023) What does a platypus look like? Generating customized prompts for zero-shot image classification [Code] by Pratt et al.
(ICCV 2023) Black Box Few-Shot Adaptation for Vision-Language models by Ouali et al.
(ICLR 2024) A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation [Code] by Wang et al.
(ICLR 2024) Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [Code] by Zhao et al.
(ICLR 2024) Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization [Code] by Zang et al.
(ICML 2024) CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection [Code] by Zhu et al.
(CVPR 2024) Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation by Zhang et al.
(CVPR 2024) Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models [Code] by Zhang et al.
(CVPR 2024) X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalizations [Code] by Kukleva et al.
(CVPR 2024) A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models [Code] by Silva-Rodriguez et al.
(CVPR 2024) On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning? by Zanella et al.
(CVPR 2024) The Neglected Tails in Vision-Language Models by Parashar et al.
(CVPRW 2024) Low-Rank Few-Shot Adaptation of Vision-Language Models by Zanella et al.
(TIP 2024) Adapting Vision-Language Models via Learning to Inject Knowledge by Xuan et al.
(NeurIPS 2024) Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective by Zhang et al.
(NeurIPS 2024) WATT: Weight Average Test-Time Adaptation of CLIP [Code] by Osowiechi et al.
(NeurIPS 2024) Frustratingly Easy Test-Time Adaptation of Vision-Language Models [Code] by Farina et al.
(NeurIPS 2024) Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models [Code] by Zhang et al.
(NeurIPS 2024) UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models [Code] by Liang et al.
(NeurIPS 2024) BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping [Code] by Zhang et al.
(arXiv 2025) Online Gaussian Test-Time Adaptation of Vision-Language Models [Code] by Fuchs et al.