Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend

### Name and Version

.\build\bin\Release\llama-cli.exe --version

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
version: 4329 (89d604f2)
built with MSVC 19.41.34120.0 for x64

### Operating systems

Windows

### GGML backends

Vulkan

### Hardware

Ryzen 5900X +RX 5700 XT

### Models

Qwen2-VL-7B-Instruct-IQ4_NL + mmproj-Qwen2-VL-7B-Instruct-f32

### Problem description & steps to reproduce

When I run it on Vulkan build, the description given by the model has nothing to do with the image given as argument (no matter the `-ngl` value, even `-ngl 0` is broken). The exact same setup works perfectly fine on CPU backend.

I know the Vulkan backend doesn't support Qwen2-VL yet, but according to https://github.com/ggerganov/llama.cpp/pull/10361#issuecomment-2543938139, this should only cause slowdowns, not invalid outputs.

### Relevant log output

#### Image input:
![Untitled](https://github.com/user-attachments/assets/7477d53e-2ee0-478a-aeb5-d82c5e44ca3a)


#### -ngl 0
```shell
> .\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
encode_image_with_clip: step 1 of 1 encoded in   843.10 ms
encode_image_with_clip: all 1 segments encoded in   843.17 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in   845.06 ms by CLIP (    2.34 ms per image patch)

The image shows a person wearing a black and white striped shirt, a black jacket, and black pants, standing in front of a black background. The person is also holding a black and white striped umbrella. The context of this image could be a fashion or clothing advertisement, showcasing the person's outfit and accessories. The black and white striped shirt, jacket, and umbrella create a monochromatic look, which is often used in fashion photography to emphasize the clothing and accessories. The black background helps to highlight the person and their outfit, making them the focal point of the image.
llama_perf_context_print:        load time =    6644.91 ms
llama_perf_context_print: prompt eval time =    2276.84 ms /   391 tokens (    5.82 ms per token,   171.73 tokens per second)
llama_perf_context_print:        eval time =   11500.85 ms /   115 runs   (  100.01 ms per token,    10.00 tokens per second)
llama_perf_context_print:       total time =   18275.28 ms /   506 tokens
```

#### -ngl 99
```shell
> .\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0 -ngl 99
[...]
encode_image_with_clip: step 1 of 1 encoded in  3248.68 ms
encode_image_with_clip: all 1 segments encoded in  3248.76 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in  3249.79 ms by CLIP (    9.00 ms per image patch)

The image appears to be a logo or a symbol, but it is not clear what it represents. It could be a brand logo, a company logo, or a symbol for a specific organization or group. Without additional context or information, it is difficult to determine the exact meaning or purpose of the image.
llama_perf_context_print:        load time =    9346.17 ms
llama_perf_context_print: prompt eval time =    1009.47 ms /   391 tokens (    2.58 ms per token,   387.33 tokens per second)
llama_perf_context_print:        eval time =    1500.12 ms /    61 runs   (   24.59 ms per token,    40.66 tokens per second)
llama_perf_context_print:       total time =   10889.94 ms /   452 tokens
```
#### CPU backend for comparison
```shell
> .\buildcpu\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
encode_image_with_clip: step 1 of 1 encoded in  8483.38 ms
encode_image_with_clip: all 1 segments encoded in  8483.47 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in  8484.85 ms by CLIP (   23.50 ms per image patch)

The image appears to be a simple text-based graphic with the words "READABLE TEXT" written in a bold, black font. The context of this image could be related to demonstrating or emphasizing the importance of clear and legible text, possibly in the context of design, typography, or user interface (UI) design. It might be used to highlight the importance of making text easy to read and understand for users.
llama_perf_context_print:        load time =   21741.16 ms
llama_perf_context_print: prompt eval time =   10924.92 ms /   391 tokens (   27.94 ms per token,    35.79 tokens per second)
llama_perf_context_print:        eval time =    8322.39 ms /    83 runs   (  100.27 ms per token,     9.97 tokens per second)
llama_perf_context_print:       total time =   30185.33 ms /   474 tokens
``` 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend #10843

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Relevant log output

Image input:

-ngl 0

-ngl 99

CPU backend for comparison

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend #10843

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Relevant log output

Image input:

-ngl 0

-ngl 99

CPU backend for comparison

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions