Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
-
Updated
Oct 1, 2024 - Python
Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
[ICCV 2023] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
[ICRA 2024] Language-Conditioned Affordance-Pose Detection in 3D Point Clouds
MTA: A Lightweight Multilingual Text Alignment Model for Cross-language Visual Word Sense Disambiguation
Hands on some MultiModal Models
Add a description, image, and links to the language-vision topic page so that developers can more easily learn about it.
To associate your repository with the language-vision topic, visit your repo's landing page and select "manage topics."