[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions"
-
Updated
Oct 9, 2024 - Python
[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions"
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions
[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"
GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.
Talk2BEV: Language-Enhanced Bird's Eye View Maps (ICRA'24)
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
[ACM Multimedia 2025] This is the official repo for Debiasing Large Visual Language Models, including a Post-Hoc debias method and Visual Debias Decoding strategy.
[CVPR 2025 🔥] EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues.
[ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models.
An benchmark for evaluating the capabilities of large vision-language models (LVLMs)
This repository is the codebase of TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
[ICLR 2025] Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality
MedM-VL is a modular, LLaVA-based codebase for medical LVLMs.
An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"
PyTorch Implementation of the Paper 'AnyAnomaly': Official Version
🚀 Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
[NeurIPS 2024] Official Repository of Multi-Object Hallucination in Vision-Language Models
The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"
Add a description, image, and links to the large-vision-language-models topic page so that developers can more easily learn about it.
To associate your repository with the large-vision-language-models topic, visit your repo's landing page and select "manage topics."