A curated list of foundation models for vision and language tasks
-
Updated
Jun 23, 2025
A curated list of foundation models for vision and language tasks
Awesome Unified Multimodal Models
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
A curated list of Awesome Personalized Large Multimodal Models resources
Video Search with CLIP
Implementation of the paper "Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning", arXiv, 2025
Multimodal Bi-Transformers (MMBT) in Biomedical Text/Image Classification
Model Mondays is a weekly livestreamed series on Microsoft Reactor that helps you make informed model choice decisions with timely updates and model deep-dives. Watch live for the content. Join Discord for the discussions.
NanoOWL Detection System enables real-time open-vocabulary object detection in ROS 2 using a TensorRT-optimized OWL-ViT model. Describe objects in natural language and detect them instantly on panoramic images. Optimized for NVIDIA GPUs with .engine acceleration.
Leverage VideoLLaMA 3's capabilities using LitServe.
Leverage Gemma 3's capabilities using LitServe.
Phi-3-Vision model test - running locally
Add a description, image, and links to the multimodal-models topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-models topic, visit your repo's landing page and select "manage topics."