Closed
Description
Name and Version
.\build\bin\Release\llama-cli.exe --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
version: 4329 (89d604f)
built with MSVC 19.41.34120.0 for x64
Operating systems
Windows
GGML backends
Vulkan
Hardware
Ryzen 5900X +RX 5700 XT
Models
Qwen2-VL-7B-Instruct-IQ4_NL + mmproj-Qwen2-VL-7B-Instruct-f32
Problem description & steps to reproduce
When I run it on Vulkan build, the description given by the model has nothing to do with the image given as argument (no matter the -ngl
value, even -ngl 0
is broken). The exact same setup works perfectly fine on CPU backend.
I know the Vulkan backend doesn't support Qwen2-VL yet, but according to #10361 (comment), this should only cause slowdowns, not invalid outputs.
Relevant log output
Image input:
-ngl 0
> .\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
encode_image_with_clip: step 1 of 1 encoded in 843.10 ms
encode_image_with_clip: all 1 segments encoded in 843.17 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens
encode_image_with_clip: image encoded in 845.06 ms by CLIP ( 2.34 ms per image patch)
The image shows a person wearing a black and white striped shirt, a black jacket, and black pants, standing in front of a black background. The person is also holding a black and white striped umbrella. The context of this image could be a fashion or clothing advertisement, showcasing the person's outfit and accessories. The black and white striped shirt, jacket, and umbrella create a monochromatic look, which is often used in fashion photography to emphasize the clothing and accessories. The black background helps to highlight the person and their outfit, making them the focal point of the image.
llama_perf_context_print: load time = 6644.91 ms
llama_perf_context_print: prompt eval time = 2276.84 ms / 391 tokens ( 5.82 ms per token, 171.73 tokens per second)
llama_perf_context_print: eval time = 11500.85 ms / 115 runs ( 100.01 ms per token, 10.00 tokens per second)
llama_perf_context_print: total time = 18275.28 ms / 506 tokens
-ngl 99
> .\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0 -ngl 99
[...]
encode_image_with_clip: step 1 of 1 encoded in 3248.68 ms
encode_image_with_clip: all 1 segments encoded in 3248.76 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens
encode_image_with_clip: image encoded in 3249.79 ms by CLIP ( 9.00 ms per image patch)
The image appears to be a logo or a symbol, but it is not clear what it represents. It could be a brand logo, a company logo, or a symbol for a specific organization or group. Without additional context or information, it is difficult to determine the exact meaning or purpose of the image.
llama_perf_context_print: load time = 9346.17 ms
llama_perf_context_print: prompt eval time = 1009.47 ms / 391 tokens ( 2.58 ms per token, 387.33 tokens per second)
llama_perf_context_print: eval time = 1500.12 ms / 61 runs ( 24.59 ms per token, 40.66 tokens per second)
llama_perf_context_print: total time = 10889.94 ms / 452 tokens
CPU backend for comparison
> .\buildcpu\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
encode_image_with_clip: step 1 of 1 encoded in 8483.38 ms
encode_image_with_clip: all 1 segments encoded in 8483.47 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens
encode_image_with_clip: image encoded in 8484.85 ms by CLIP ( 23.50 ms per image patch)
The image appears to be a simple text-based graphic with the words "READABLE TEXT" written in a bold, black font. The context of this image could be related to demonstrating or emphasizing the importance of clear and legible text, possibly in the context of design, typography, or user interface (UI) design. It might be used to highlight the importance of making text easy to read and understand for users.
llama_perf_context_print: load time = 21741.16 ms
llama_perf_context_print: prompt eval time = 10924.92 ms / 391 tokens ( 27.94 ms per token, 35.79 tokens per second)
llama_perf_context_print: eval time = 8322.39 ms / 83 runs ( 100.27 ms per token, 9.97 tokens per second)
llama_perf_context_print: total time = 30185.33 ms / 474 tokens