#

multimodal-large-language-models

Here are 152 public repositories matching this topic...

X-PLUG / MobileAgent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

android agent harmony ios app gui automation mobile copilot multimodal mobile-agents mllm multimodal-large-language-models gpt4v multimodal-agent

Updated Apr 10, 2025
Python

joanrod / star-vector

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

svg vlm llm multimodal-large-language-models

Updated Mar 26, 2025
Python

modelscope / modelscope-agent

ModelScope-Agent: An agent framework connecting models in ModelScope with the world

agent data-science code chatbot android-application multi-agents rag mobile-agents gpts llm multimodal-large-language-models qwen assistantapi chatglm-4 open-gpts mobile-agent codexgraph data-science-assistant

Updated Feb 27, 2025
Python

ictnlp / LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

speech-to-text speech-to-speech large-language-models multimodal-large-language-models speech-language-model speech-interaction

Updated Nov 14, 2024
Python

VITA-MLLM / VITA

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

multimodal-large-language-models large-multimodal-models

Updated Mar 28, 2025
Python

X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

multimodal table-understanding document-understanding mllm multimodal-large-language-models chart-understanding

Updated Dec 24, 2024
Python

cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

computer-vision chatbot representation-learning clip dino large-language-models llms instruction-tuning mllm multimodal-large-language-models

Updated Oct 30, 2024
Python

BAAI-DCAI / Bunny

A family of lightweight multimodal models.

english chinese vlm gpt-4 chatgpt mllm multimodal-large-language-models

Updated Nov 18, 2024
Python

AIDC-AI / Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

chatbot multimodality multimodal vision-language-model multimodal-large-language-models vision-language-learning qwen llama3

Updated Mar 25, 2025
Python

Henry-23 / VideoChat

实时语音交互数字人，支持端到端语音方案（GLM-4-Voice - THG）和级联方案（ASR-LLM-TTS-THG）。可自定义形象与音色，无须训练，支持音色克隆，首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.

streaming real-time end-to-end tts lip-sync dialogue-systems asr talking-head digital-human multimodal-large-language-models musetalk gradio-python-app

Updated Mar 21, 2025
Python

X-LANCE / SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model

speech-processing audio-processing peft music-processing large-language-model multimodal-large-language-models

Updated Apr 12, 2025
Python

LLaVA-VL / LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

agent tool-use large-language-models multimodal-large-language-models large-multimodal-models

Updated Feb 1, 2024
Python

deepglint / unicom

Large-Scale Visual Representation Model

embodied-artificial-intelligence vision-transformer large-language-models large-sacle-pretrained-model laion400m multimodal-large-language-models

Updated Apr 6, 2025
Python

VITA-MLLM / Woodpecker

✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models

multimodality hallucination hallucinations large-language-models llm mllm multimodal-large-language-models

Updated Dec 23, 2024
Python

rese1f / MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

computer-vision dataset llama large-language-models long-video-understanding multimodal-large-language-models

Updated Jan 29, 2025
Python

SkyworkAI / Vitron

NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

segmentation mllm multimodal-large-language-models

Updated Oct 20, 2024
Python

NVIDIA / audio-flamingo

PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.

audio-captioning multimodal-large-language-models audio-language-models audio-question-answering audio-reasoning

Updated Mar 21, 2025
Python

ictnlp / LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

video efficient vision llama multimodal large-language-models vision-language-model llava visual-instruction-tuning multimodal-large-language-models gpt4v large-multimodal-models gpt4o

Updated Jan 13, 2025
Python

hustvl / EVF-SAM

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"

segmentation multimodal referring-image-segmentation segment-anything multimodal-large-language-models

Updated Mar 17, 2025
Python

Liquid

FoundationVision / Liquid

Liquid: Language Models are Scalable and Unified Multi-modal Generators

generative text-to-image image-gen autoregressive-models large-language-models text-to-image-generation llms generative-ai multimodal-large-language-models

Updated Apr 8, 2025
Python

Improve this page

Add a description, image, and links to the multimodal-large-language-models topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the multimodal-large-language-models topic, visit your repo's landing page and select "manage topics."