Skip to content

Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend #10843

Closed
@stduhpf

Description

@stduhpf

Name and Version

.\build\bin\Release\llama-cli.exe --version

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
version: 4329 (89d604f)
built with MSVC 19.41.34120.0 for x64

Operating systems

Windows

GGML backends

Vulkan

Hardware

Ryzen 5900X +RX 5700 XT

Models

Qwen2-VL-7B-Instruct-IQ4_NL + mmproj-Qwen2-VL-7B-Instruct-f32

Problem description & steps to reproduce

When I run it on Vulkan build, the description given by the model has nothing to do with the image given as argument (no matter the -ngl value, even -ngl 0 is broken). The exact same setup works perfectly fine on CPU backend.

I know the Vulkan backend doesn't support Qwen2-VL yet, but according to #10361 (comment), this should only cause slowdowns, not invalid outputs.

Relevant log output

Image input:

Untitled

-ngl 0

> .\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
encode_image_with_clip: step 1 of 1 encoded in   843.10 ms
encode_image_with_clip: all 1 segments encoded in   843.17 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in   845.06 ms by CLIP (    2.34 ms per image patch)

The image shows a person wearing a black and white striped shirt, a black jacket, and black pants, standing in front of a black background. The person is also holding a black and white striped umbrella. The context of this image could be a fashion or clothing advertisement, showcasing the person's outfit and accessories. The black and white striped shirt, jacket, and umbrella create a monochromatic look, which is often used in fashion photography to emphasize the clothing and accessories. The black background helps to highlight the person and their outfit, making them the focal point of the image.
llama_perf_context_print:        load time =    6644.91 ms
llama_perf_context_print: prompt eval time =    2276.84 ms /   391 tokens (    5.82 ms per token,   171.73 tokens per second)
llama_perf_context_print:        eval time =   11500.85 ms /   115 runs   (  100.01 ms per token,    10.00 tokens per second)
llama_perf_context_print:       total time =   18275.28 ms /   506 tokens

-ngl 99

> .\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0 -ngl 99
[...]
encode_image_with_clip: step 1 of 1 encoded in  3248.68 ms
encode_image_with_clip: all 1 segments encoded in  3248.76 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in  3249.79 ms by CLIP (    9.00 ms per image patch)

The image appears to be a logo or a symbol, but it is not clear what it represents. It could be a brand logo, a company logo, or a symbol for a specific organization or group. Without additional context or information, it is difficult to determine the exact meaning or purpose of the image.
llama_perf_context_print:        load time =    9346.17 ms
llama_perf_context_print: prompt eval time =    1009.47 ms /   391 tokens (    2.58 ms per token,   387.33 tokens per second)
llama_perf_context_print:        eval time =    1500.12 ms /    61 runs   (   24.59 ms per token,    40.66 tokens per second)
llama_perf_context_print:       total time =   10889.94 ms /   452 tokens

CPU backend for comparison

> .\buildcpu\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
encode_image_with_clip: step 1 of 1 encoded in  8483.38 ms
encode_image_with_clip: all 1 segments encoded in  8483.47 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in  8484.85 ms by CLIP (   23.50 ms per image patch)

The image appears to be a simple text-based graphic with the words "READABLE TEXT" written in a bold, black font. The context of this image could be related to demonstrating or emphasizing the importance of clear and legible text, possibly in the context of design, typography, or user interface (UI) design. It might be used to highlight the importance of making text easy to read and understand for users.
llama_perf_context_print:        load time =   21741.16 ms
llama_perf_context_print: prompt eval time =   10924.92 ms /   391 tokens (   27.94 ms per token,    35.79 tokens per second)
llama_perf_context_print:        eval time =    8322.39 ms /    83 runs   (  100.27 ms per token,     9.97 tokens per second)
llama_perf_context_print:       total time =   30185.33 ms /   474 tokens

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions