Skip to content

Eval bug: Mistral Small Multiomodal fails when used with the Vulkan backend #13778

Closed
@ddpasa

Description

@ddpasa

Name and Version

It works fine with the cpu backend. I'm using -ngl 0 because the decoding is faster on the cpu, but the igpu makes a massive improvement in the image processing. Both ggufs were downloaded from the ggml-org Huggingface repo.

Vulkan inference works fine with moondream2, glm, internvlm, smolvlm and qwen2.5vlm.

The error is:
llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6524: GGML_ASSERT(ggml_vk_op_supports_incontiguous(op) || ggml_vk_dim01_contiguous(src0)) failed

llama.cpp version:

llama-mtmd-cli --version

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Plus Graphics (ICL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
version: 5471 (ffd0eae6)
built with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu

Full log is:

llama.cpp/build_vulkan/bin/llama-mtmd-cli -ngl 0 -m /vlms/mistral-small-31-24b-text-IQ2_M.gguf --mmproj vlms/mistral-small-31-24b-mmproj-f16.gguf --image /tmp/tmp5po9zi2y.jpg -p 'Describe this image in more than 10 words but less than 50 words.'
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Plus Graphics (ICL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 5471 (ffd0eae6) with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Iris(R) Plus Graphics (ICL GT2)) - 7771 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 363 tensors from vlms/mistral-small-31-24b-text-IQ2_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Mistral-Small-3.1-24B-Instruct-2503
llama_model_loader: - kv   3:                            general.version str              = 2503
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Mistral-Small-3.1-24B-Instruct-2503
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 24B
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                          llama.block_count u32              = 40
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 32768
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 1000000000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 131072
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 11
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  32:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 29
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = Mistral-Small-3.1-24B-Instruct-2503-G...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = unsloth_calibration_Mistral-Small-3.1...
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 280
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 55
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q3_K:    1 tensors
llama_model_loader: - type q4_K:   40 tensors
llama_model_loader: - type q5_K:    1 tensors
llama_model_loader: - type iq2_xs:   22 tensors
llama_model_loader: - type iq3_xxs:   72 tensors
llama_model_loader: - type iq3_s:   58 tensors
llama_model_loader: - type iq2_s:   88 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ2_M - 2.7 bpw
print_info: file size   = 8.15 GiB (2.97 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 32768
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 13B
print_info: model params     = 23.57 B
print_info: general.name     = Mistral-Small-3.1-24B-Instruct-2503
print_info: vocab type       = BPE
print_info: n_vocab          = 131072
print_info: n_merges         = 269443
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 11 '<pad>'
print_info: LF token         = 1010 'Ċ'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  8342.83 MiB
..............................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.50 MiB
llama_kv_cache_unified:        CPU KV buffer size =   640.00 MiB
llama_kv_cache_unified: size =  640.00 MiB (  4096 cells,  40 layers,  1 seqs), K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_context:    Vulkan0 compute buffer size =   706.00 MiB
llama_context: Vulkan_Host compute buffer size =    18.01 MiB
llama_context: graph nodes  = 1446
llama_context: graph splits = 444 (with bs=512), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
Failed to infer a tool call example (possible template bug)
mtmd_cli_context: chat template example:
[SYSTEM_PROMPT]You are a helpful assistant[/SYSTEM_PROMPT][INST]Hello[/INST]Hi there</s>[INST]How are you?[/INST]
clip_ctx: CLIP using Vulkan0 backend
clip_model_loader: model name:   
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    223
clip_model_loader: n_kv:         27

load_hparams: projector:          pixtral
load_hparams: has_vision_encoder: 1
load_hparams: has_audio_encoder:  0
load_hparams: n_embd:             1024
load_hparams: n_head:             16
load_hparams: n_ff:               4096
load_hparams: n_layer:            24
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     5120

load_hparams: image_size:         1540
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       0

load_hparams: model size:         837.36 MiB
load_hparams: metadata size:      0.08 MiB
alloc_compute_meta:    Vulkan0 compute buffer size =     2.97 MiB
alloc_compute_meta:        CPU compute buffer size =     0.14 MiB
llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6524: GGML_ASSERT(ggml_vk_op_supports_incontiguous(op) || ggml_vk_dim01_contiguous(src0)) failed

Operating systems

Linux

GGML backends

Vulkan

Hardware

Intel Iris Plus Graphics G7 on i7-1065G7
i915 driver on Linux 6.14.6
Mesa version 25.0.5 supporting Vulkan 1.4.305

Models

https://huggingface.co/ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF

Problem description & steps to reproduce

llama.cpp/build_vulkan/bin/llama-mtmd-cli -ngl 0 -m /vlms/mistral-small-31-24b-text-IQ2_M.gguf --mmproj vlms/mistral-small-31-24b-mmproj-f16.gguf --image /tmp/tmp5po9zi2y.jpg -p 'Describe this image in more than 10 words but less than 50 words.'

fails with:

llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6524: GGML_ASSERT(ggml_vk_op_supports_incontiguous(op) || ggml_vk_dim01_contiguous(src0)) failed

First Bad Commit

No response

Relevant log output

See above

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions