Skip to content

Bug: Failed to run qwen2-57b-a14b-instruct-fp16. #9628

Open
@tang-t21

Description

@tang-t21

What happened?

I am trying to run Qwen2-57B-A14B-instruct, and I used llama-gguf-split to merge the gguf files from Qwen/Qwen2-57B-A14B-Instruct-GGUF. But it's aborted with terminate called after throwing an instance of 'std::length_error' what(): vector::_M_default_append Aborted (core dumped)

Name and Version

./build/bin/llama-cli --version
version: 3808 (699a0dc)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

`(llama) root@201edf3683be:/home/llama.cpp# ./build/bin/llama-cli -m ./models/qwen2-57b-a14b-instruct-fp16.gguf -p "Beijing is the capital of" -n 64 -c 4096
build: 3808 (699a0dc1) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 28 key-value pairs and 479 tensors from ./models/qwen2-57b-a14b-instruct-fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2moe
llama_model_loader: - kv   1:                               general.name str              = Qwen2-MoE-A14.2B-Chat
llama_model_loader: - kv   2:                       qwen2moe.block_count u32              = 28
llama_model_loader: - kv   3:                    qwen2moe.context_length u32              = 32768
llama_model_loader: - kv   4:                  qwen2moe.embedding_length u32              = 3584
llama_model_loader: - kv   5:              qwen2moe.attention.head_count u32              = 28
llama_model_loader: - kv   6:           qwen2moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv   7:                    qwen2moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   8:  qwen2moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 qwen2moe.expert_used_count u32              = 8
llama_model_loader: - kv  10:                      qwen2moe.expert_count u32              = 64
llama_model_loader: - kv  11:        qwen2moe.expert_feed_forward_length u32              = 2560
llama_model_loader: - kv  12:               qwen2moe.feed_forward_length u32              = 20480
llama_model_loader: - kv  13:                          general.file_type u32              = 1
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                                   split.no u16              = 0
llama_model_loader: - kv  26:                                split.count u16              = 0
llama_model_loader: - kv  27:                        split.tensors.count i32              = 479
llama_model_loader: - type  f32:  197 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: special tokens cache size = 293
llm_load_vocab: token to piece cache size = 0.9338 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 20480
llm_load_print_meta: n_expert         = 64
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 57B.A14B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 57.41 B
llm_load_print_meta: model size       = 106.94 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Qwen2-MoE-A14.2B-Chat
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 2560
llm_load_print_meta: n_ff_shexp       = 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.20 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:        CPU buffer size = 109511.40 MiB
.............................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 1349.38 MiB
ggml_gallocr_reserve_n: reallocating CUDA1 buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA2 buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA3 buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 15.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1349.38 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    15.01 MiB
llama_new_context_with_model: graph nodes  = 1910
llama_new_context_with_model: graph splits = 536
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
main: llama threadpool init, n_threads = 128

system_info: n_threads = 128 (n_threads_batch = 128) / 255 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted (core dumped)`

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomershigh severityUsed to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions