Description
Multimodal has been removed since #5882
Current llama.cpp multimodal roadmap
(update 9th april 2025)
mtmd
(MulTi-MoDal) library (top prio 🔥 )- Implement
libmtmd
: llava : introduce libmtmd #12849 - Support more models via
libmtmd
(top prio 🔥 ) : mtmd : merge llava, gemma3 and minicpmv CLI into singlellama-mtmd-cli
#13012 - Support M-RoPE models via
libmtmd
(Qwen2VL, Qwen2.5VL) : mtmd : add qwen2vl and qwen2.5vl #13141 - Support audio input
- Use smart pointer in
clip.cpp
to avoid mem leak: clip : use smart pointer (⚠️ breaking change) #12869 -
Add wrapper for--> Probably don't need since we're already having some helper instb_image
to avoid polluting project with the big header filelibmtmd
acting as wrapper for stb_image - Unify conversion scripts --> best case scenario: having
convert_hf_to_gguf.py
that can output both text + vision GGUF files --> introduced in convert : experimental support for--mmproj
flag #13023 - Remove BOI / EOI token embeddings from clip.cpp (used by glm-edge): clip : remove boi/eoi embeddings for GLM-edge model (⚠️ breaking change) #13081
- Refactor documentations (find a way to reduce number of README files): llava : update documentations #13055
- Implement
- Implement
libmtmd
in server API and server web UI (top prio 🔥 )- Publish first proposal: server : vision support via libmtmd #12898
- User can upload image from UI ( + drag-and-drop)
- Nice-to-have: Better KV caching strategy (TBD)
- Nice-to-have: allow loading remote image (may come with security risk)
- Update the security policy, make it clear that bugs related to 3rd party lib (like
stb_image
) should be reported to upstream, not in llama.cpp
- Unify all vision CLI (like
minicpmv-cli
,gemma3-cli
, etc) into a single CLI - Add deprecation notice for
llava.h
(we will remove libllava) andclip.h
(clip is now internal-only) - Experimental support for audio input: (wip) support ultravox audio input #12745
- (far in the future) implement
llama_multimodal
API that supports image, audio and more!