Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
-
Updated
Apr 10, 2025 - Python
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.
ModelScope-Agent: An agent framework connecting models in ModelScope with the world
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
实时语音交互数字人,支持端到端语音方案(GLM-4-Voice - THG)和级联方案(ASR-LLM-TTS-THG)。可自定义形象与音色,无须训练,支持音色克隆,首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.
Speech, Language, Audio, Music Processing with Large Language Model
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
Large-Scale Visual Representation Model
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
Liquid: Language Models are Scalable and Unified Multi-modal Generators
Add a description, image, and links to the multimodal-large-language-models topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-large-language-models topic, visit your repo's landing page and select "manage topics."