Here are
34 public repositories
matching this topic...
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics recognition capability.
Updated
Sep 22, 2025
Python
Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"
Updated
May 8, 2025
Python
Research Code for Multimodal-Cognition Team in Ant Group
Updated
Oct 14, 2025
Python
[ICCV25 Oral] Token Activation Map to Visually Explain Multimodal LLMs
Updated
Aug 8, 2025
Python
Official repository for InfiGUI-G1. We introduce Adaptive Exploration Policy Optimization (AEPO) to overcome semantic alignment bottlenecks in GUI agents through efficient, guided exploration.
Updated
Nov 19, 2025
Python
[IROS'25 Oral & NeurIPSw'24] Official implementation of "MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control "
Updated
Jun 16, 2025
Python
[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"
Updated
Nov 28, 2023
Python
The code repository for "Wings: Learning Multimodal LLMs without Text-only Forgetting" [NeurIPS 2024]
Updated
Dec 28, 2024
Python
Official repository of the paper: Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
[NAACL 2025 Findings] Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding
Updated
Jun 20, 2025
Python
[ACL 2024] Dataset and Code of "ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction"
Updated
Jun 10, 2024
Jupyter Notebook
Efficient Test-Time Scaling for Small Vision-Language Models, official implementation of the paper, test-time scaling via test-time augmentation
Updated
Nov 17, 2025
Python
Medical Report Generation And VQA (Adapting XrayGPT to Any Modality)
Updated
Jun 28, 2025
Python
Streamlit app to chat with images using Multi-modal LLMs.
Updated
Mar 17, 2024
Python
Official implementation of ICML 2025 paper "Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach"
Updated
May 27, 2025
Python
Q-HEART: ECG Question Answering via Knowledge-Informed Multimodal LLMs (ECAI 2025)
Updated
Aug 22, 2025
Python
SpatialFusion-LM is a real-time spatial reasoning framework that combines neural depth, 3D reconstruction, and language-driven scene understanding.
Updated
Nov 19, 2025
Python
LLaVA base model for use with Autodistill.
Updated
Jan 24, 2024
Python
Kani extension for supporting vision-language models (VLMs). Comes with model-agnostic support for GPT-Vision and LLaVA.
Updated
Jul 2, 2025
Python
A minimal, hackable Vision-Language Model built on Karpathy’s nanochat — add image understanding and multimodal chat for under $200 in compute.
Updated
Nov 19, 2025
Python
Improve this page
Add a description, image, and links to the
multimodal-llm
topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the
multimodal-llm
topic, visit your repo's landing page and select "manage topics."
Learn more
You can’t perform that action at this time.