[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
-
Updated
Aug 12, 2024 - Python
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
Get clean data from tricky documents, powered by vision-language models ⚡
Implementation for Describe Anything: Detailed Localized Image and Video Captioning
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.
Parsing-free RAG supported by VLMs
Add a description, image, and links to the vision-language-model topic page so that developers can more easily learn about it.
To associate your repository with the vision-language-model topic, visit your repo's landing page and select "manage topics."