A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
-
Updated
Nov 4, 2024 - Python
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning
Code for ACL 2023 Oral Paper: ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
The official implementation for the ICCV 2023 paper "Grounded Image Text Matching with Mismatched Relation Reasoning".
Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation (Published in IEEE TMM 2023)
Code for ECIR 2023 paper "Dialogue-to-Video Retrieval"
Explore the rich flavors of Indian desserts with TunedLlavaDelights. Utilizing the in Llava fine-tuning, our project unveils detailed nutritional profiles, taste notes, and optimal consumption times for beloved sweets. Dive into a fusion of AI innovation and culinary tradition
Socratic models for multimodal reasoning & image captioning
Add a description, image, and links to the vision-language-learning topic page so that developers can more easily learn about it.
To associate your repository with the vision-language-learning topic, visit your repo's landing page and select "manage topics."