Skip to content

Bug: llama-server crashes while serving a Mamba model - GGML_ASSERT(cell.has_seq_id(seq_id)) failed #9170

Closed
@jploski

Description

@jploski

What happened?

While trying to serve a Mamba model with llama-server e.g.

/mnt/seagate/dalai/llama.cpp/llama-server --port 11111 -m /mnt/f2fs/mamba/tinystories.gguf -t 1 -c 8192 

the first completion request is served ok, but the second request crashes the server with "GGML_ASSERT(cell.has_seq_id(seq_id)) failed" message.

Name and Version

version: 3620 (e11bd85)
built with cc (Debian 8.3.0-6) 8.3.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

First request body to POST /completion:

{"prompt": "<|im_start|> system\\nYou are a friendly assistant, ready to answer your users\' questions and follow their instructions,\\nmaking sure you agree with the user\'s expressed sentiment. Do not refer to yourself as AI.\\n<|im_end|><|im_start|>User\\nSarah and Ben were playing with the snow.\\n\\n<|im_end|>\\n<|im_start|>AI\\n", "temperature": 1.0, "top_p": 0.98, "repeat_penalty": 1.3, "repeat_last_n": 64, "n_predict": 2048, "stop": ["<|im_end|>"], "stream": true}

Second request body to POST /completion (includes prompt prefix and generation from the first request):

{"prompt": "<|im_start|> system\\nYou are a friendly assistant, ready to answer your users\' questions and follow their instructions,\\nmaking sure you agree with the user\'s expressed sentiment. Do not refer to yourself as AI.\\n<|im_end|><|im_start|>User\\nSarah and Ben were playing with the snow.\\n\\n<|im_end|>\\n<|im_start|>AI\\n\\"It\'s nice to see them play, but I like it too!\\"\\nBen asked. He wanted a good feeling and was excited for the game to start.\\nBut when he went inside the house, the door wasn\'t open. He opened another room with no one in there.\\nWhen he got back home, he saw that everything was gone!\\nHe felt sad.\\n\\"I wish I had someone like me!\\"\\n<|im_end|>\\n<|im_start|>User\\nSecond request.\\n\\n<|im_end|>\\n<|im_start|>AI\\n", "temperature": 1.0, "top_p": 0.98, "repeat_penalty": 1.3, "repeat_last_n": 64, "n_predict": 2048, "stop": ["<|im_end|>"], "stream": true}

Output from server process:

INFO [                    main] build info | tid="140478308075456" timestamp=1724599933 build=3620 commit="e11bd856"
INFO [                    main] system info | tid="140478308075456" timestamp=1724599933 n_threads=1 n_threads_batch=-1 total_threads=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
INFO [                    main] HTTP server is listening | tid="140478308075456" timestamp=1724599933 n_threads_http="7" hostname="127.0.0.1" port="11111"
INFO [                    main] loading model | tid="140478308075456" timestamp=1724599933 n_threads_http="7" hostname="127.0.0.1" port="11111"
llama_model_loader: loaded meta data with 23 key-value pairs and 242 tensors from /mnt/f2fs/mamba/tinystories.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mamba
llama_model_loader: - kv   1:                               general.name str              = results
llama_model_loader: - kv   2:                       mamba.context_length u32              = 1048576
llama_model_loader: - kv   3:                     mamba.embedding_length u32              = 768
llama_model_loader: - kv   4:                  mamba.feed_forward_length u32              = 0
llama_model_loader: - kv   5:                 mamba.attention.head_count u32              = 0
llama_model_loader: - kv   6:                          mamba.block_count u32              = 24
llama_model_loader: - kv   7:                      mamba.ssm.conv_kernel u32              = 4
llama_model_loader: - kv   8:                       mamba.ssm.inner_size u32              = 1536
llama_model_loader: - kv   9:                       mamba.ssm.state_size u32              = 16
llama_model_loader: - kv  10:                   mamba.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  11:     mamba.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 7
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = mpt
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,50280]   = ["<|endoftext|>", "<|padding|>", "!",...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,50280]   = [3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,50009]   = ["Ġ Ġ", "Ġ t", "Ġ a", "h e", "i n...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  193 tensors
llama_model_loader: - type q8_0:   49 tensors
llm_load_vocab: special tokens cache size = 28
llm_load_vocab: token to piece cache size = 0.2984 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = mamba
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 50280
llm_load_print_meta: n_merges         = 50009
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 1048576
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 0
llm_load_print_meta: n_head_kv        = 0
llm_load_print_meta: n_rot            = 0
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 0
llm_load_print_meta: n_embd_head_v    = 0
llm_load_print_meta: n_gqa            = 0
llm_load_print_meta: n_embd_k_gqa     = 0
llm_load_print_meta: n_embd_v_gqa     = 0
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 0
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = -1
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 1048576
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 4
llm_load_print_meta: ssm_d_inner      = 1536
llm_load_print_meta: ssm_d_state      = 16
llm_load_print_meta: ssm_dt_rank      = 48
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 0.1B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 129.14 M
llm_load_print_meta: model size       = 146.50 MiB (9.52 BPW) 
llm_load_print_meta: general.name     = results
llm_load_print_meta: BOS token        = 0 '<|endoftext|>'
llm_load_print_meta: EOS token        = 0 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<|endoftext|>'
llm_load_print_meta: PAD token        = 0 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 0 '<|endoftext|>'
llm_load_print_meta: max token length = 1024
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =   146.50 MiB
.....................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     5.34 MiB
llama_new_context_with_model: KV self size  =    5.34 MiB, K (f32):    0.84 MiB, V (f32):    4.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.38 MiB
llama_new_context_with_model:        CPU compute buffer size =   102.93 MiB
llama_new_context_with_model: graph nodes  = 1182
llama_new_context_with_model: graph splits = 1
INFO [                    init] initializing slots | tid="140478308075456" timestamp=1724599933 n_slots=1
INFO [                    init] new slot | tid="140478308075456" timestamp=1724599933 id_slot=0 n_ctx_slot=8192
INFO [                    main] model loaded | tid="140478308075456" timestamp=1724599933
INFO [                    main] chat template | tid="140478308075456" timestamp=1724599933 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [            update_slots] all slots are idle | tid="140478308075456" timestamp=1724599933
INFO [   launch_slot_with_task] slot is processing task | tid="140478308075456" timestamp=1724599947 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140478308075456" timestamp=1724599947 id_slot=0 id_task=0 p0=0
INFO [           print_timings] prompt eval time     =     881.99 ms /    87 tokens (   10.14 ms per token,    98.64 tokens per second) | tid="140478308075456" timestamp=1724599950 id_slot=0 id_task=0 t_prompt_processing=881.99 n_prompt_tokens_processed=87 t_token=10.137816091954024 n_tokens_second=98.64057415616956
INFO [           print_timings] generation eval time =    2167.03 ms /    88 runs   (   24.63 ms per token,    40.61 tokens per second) | tid="140478308075456" timestamp=1724599950 id_slot=0 id_task=0 t_token_generation=2167.025 n_decoded=88 t_token=24.62528409090909 n_tokens_second=40.608668566352485
INFO [           print_timings]           total time =    3049.02 ms | tid="140478308075456" timestamp=1724599950 id_slot=0 id_task=0 t_prompt_processing=881.99 t_token_generation=2167.025 t_total=3049.0150000000003
INFO [            update_slots] slot released | tid="140478308075456" timestamp=1724599950 id_slot=0 id_task=0 n_ctx=8192 n_past=174 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="140478308075456" timestamp=1724599950
INFO [      log_server_request] request | tid="140478299678464" timestamp=1724599950 remote_addr="127.0.0.1" remote_port=35486 status=200 method="POST" path="/completion" params={}
INFO [            update_slots] all slots are idle | tid="140478308075456" timestamp=1724599950
INFO [   launch_slot_with_task] slot is processing task | tid="140478308075456" timestamp=1724599956 id_slot=0 id_task=90
INFO [            update_slots] kv cache rm [p0, end) | tid="140478308075456" timestamp=1724599956 id_slot=0 id_task=90 p0=0
src/llama.cpp:3504: GGML_ASSERT(cell.has_seq_id(seq_id)) failed
No symbol table is loaded.  Use the "file" command.
[New LWP 27301]
[New LWP 27302]
[New LWP 27303]
[New LWP 27304]
[New LWP 27305]
[New LWP 27306]
[New LWP 27307]
[New LWP 27308]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fc3a7c490ca in __waitpid (pid=27347, stat_loc=stat_loc@entry=0x7ffe5b376b64, options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
30	../sysdeps/unix/sysv/linux/waitpid.c: No such file or directory.
No symbol "frame" in current context.
[Inferior 1 (process 27300) detached]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions