Skip to content

Bug: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed with deepseek2 #9155

Closed
@mann1x

Description

@mann1x

What happened?

b3614 release simplify Mamba with advanced batch splits (#8526) broke quantization for deepseek2
rolling back to b3613 works fine

Name and Version

llama-cli --version
version: 3614 (a1631e5)
built with cc (Debian 10.2.1-6) 10.2.1 20210110 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

main: build = 3614 (a1631e53)
main: built with cc (Debian 10.2.1-6) 10.2.1 20210110 for x86_64-linux-gnu
main: quantizing 'deepseek-coder-v2-lite-instruct.fp32.bin' to 'deepseek-coder-v2-lite-instruct.Q5_0.gguf' as Q5_0
llama_model_loader: loaded meta data with 44 key-value pairs and 377 tensors from deepseek-coder-v2-lite-instruct.fp32.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = ..
llama_model_loader: - kv   3:                           general.finetune str              = ..
llama_model_loader: - kv   4:                         general.size_label str              = 64x1.5B
llama_model_loader: - kv   5:                            general.license str              = other
llama_model_loader: - kv   6:                       general.license.name str              = deepseek-license
llama_model_loader: - kv   7:                       general.license.link str              = LICENSE
llama_model_loader: - kv   8:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   9:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  10:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv  11:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv  12:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv  13:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv  14:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  15: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  17:                          general.file_type u32              = 0
llama_model_loader: - kv  18:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  19:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  27:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  28:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  29:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  30: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  31: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  43:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  377 tensors
/shared/dev/llama.cpp/src/llama.cpp:16840: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f207e755746 in __GI___wait4 (pid=271293, stat_loc=0x7ffdfaa194c4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f207e755746 in __GI___wait4 (pid=271293, stat_loc=0x7ffdfaa194c4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000055a032cd37a9 in ggml_abort ()
#2  0x000055a032be7197 in llama_model_quantize_internal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>,
 std::allocator<char> > const&, llama_model_quantize_params const*) ()
#3  0x000055a032be74d5 in llama_model_quantize ()
#4  0x000055a032b769fa in main ()

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions