-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Yi-VL and a templating addon/fix for mobileVLM #5093
Support for Yi-VL and a templating addon/fix for mobileVLM #5093
Conversation
Demo of Yi-VL-6B is here: #5092 |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
bugfix for new conversions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly in my tests it hallucinated strongly, it's image detection are still SOTA for llava-based multimodals.
What does "image detection" mean in this context?
My english writing degrades with daytime, I think that was in the early morning after a 16 hour day;) |
There is also chance that the current implementation in Is it just the license photo that you are testing with? Seems like a very small sample to make a call of the overall performance |
I agree on the potential implementation problem, though it works pretty well for that and I didn't find something obvious. I think the PR is ok for a merge, even if there is an issue remaining - it adds the capability of layer norm llava models and the templating. From there we can fix/add further details. Below is the cats reference example, the output looks good to me but it is slightly different which could also be attributed to the image algorithm used in llama.cpp compared to the higher quality one in python. Reference:
Q5K + fp16:
Q3K + Q6K:
With higher temperature I also had this:
For comparison, this is ShareGPT4V-13B:
It has one hallucination regarding the 3rd cat also eating. (What we really need is CogVLM support, but that needs a custom llama architecture - it separates the visual token attention.) |
I uploaded a couple GGUF variants for both Yi-VL on HF: https://huggingface.co/cmp-nct |
Thanks for these! Just wondering, are you planning to release 34B with higher quants? |
I'm uploading a Q5K one, you'll need CPU offloading, dual GPU or a >24GB vram gpu for the bigger ones |
https://huggingface.co/01-ai/Yi-VL-34B/discussions/10#65b337ea321c51cd17d06135 It seems that a different evaluation dataset may get a quite different evaluation result... |
any info on how well the 34b version performs on descriptions of images (non licenses)? |
I recommend to try it out. One more issue I'm having with it is that it breaks instruction tuning. Overall I recommend testing it for your use case. |
Your picture with CogVLM - 17b - temp 0.1
|
Thanks so much for the higher quant! |
Yes, both of my uploaded mmproj projectors are from the respective correct VIT. I currently use a hacked together binary, nothing I'd want to PR here. The quantization functions are in clip.cpp If you need something else than Q6K let me know, I can upload it. |
I noticed a lot of Llava based model provided mmproj just in f16. I wondered why. If it's not too much trouble, I'd love to try model in Q4_K_M with mmproj in f16. Thanks! |
I'm uploading both, will take a while for the Q4K |
Is there any way to stop the model from outputting things like "Human:"? It continues to chat by itself until it reaches end of token. It is hallucinating extremely much, ShareGPT-13B is waaay better. |
Thanks @cmp-nct for q4 model and f16 mmproj! |
I've not used the server example, I only made it compatible for You definitely need to add this stopword support (it's probably just a line at the right place to .find('###') into the server example if nothing like that is already available. |
Yeah I think server uses Bot Name as stopwords, but for some reason, the model keeps printing out things that come out as empty strings. The main reason I like using the server is because llava-cli -i for interactive doesn't work for some reason. It just exits after one completion, so you can't have multi-turn chat. Also server keeps the model loaded, so you can easily load different images quickly. Lastly, it has nice API over http. Could you look at it if you have a chance? I'd appreciate it. |
* Support for Yi-VL, templating fix for mobileVLM * ws * Update examples/llava/clip.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llava-cli.cpp * Update clip.cpp bugfix for new conversions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Support for Yi-VL, templating fix for mobileVLM * ws * Update examples/llava/clip.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llava-cli.cpp * Update clip.cpp bugfix for new conversions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
mobileVLM support was recently added, the readme says the following:
./llava-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \ --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \ --image path/to/an/image.jpg \ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? Answer the question using a single word or phrase. ASSISTANT:"
@XiaotaoChen This can't work, the prompt in llava-cli is used as User-question prompt, not as full template.
So if you use it like that in Master you'll have a Vicuna system prompt, then the image embedding, then your entire template prompt followed by double "ASSISTANT:ASSISTANT:"
With this PR it should work, I've not tested mobileVLM yet.
What this does it is looks for
<image>
in the prompt, if it's present it splits the prompt up into a system and user prompt, interjects the image embeddings between.For Yi-VL-6B, example:
.\bin\Debug\llava-cli.exe -m Q:\models\llava\Yi-VL-6B\ggml-model-f16.gguf --mmproj Q:\models\llava\Yi-VL-6B\vit\mmproj-model-f16.gguf --image C:\temp\license_demo.jpg -p "This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角 色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。 \n\n### Human: <image>\nProvide a complete representation of what is in this image. Respond in JSON-pretty-print syntax for database insert.\n### Assistant:" -ngl 50 --temp 0 -n 500 -c 2048 -e
"Provide a complete representation of what is in this image. Respond in JSON-pretty-print syntax for database insert." is the Question
Yi-VL support:
Yi-VL uses a layer-norm in addition to the projector, it uses a larger 448x image with "huge" ViT (twice the size of llava-1.5).
Sadly in my tests it hallucinated strongly, it's image VQA is still SOTA for llava-based multimodals. A strange combination.
I added: