Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server UI bug: corrupted generation #9836

Open
ivanstepanovftw opened this issue Oct 11, 2024 · 0 comments
Open

Server UI bug: corrupted generation #9836

ivanstepanovftw opened this issue Oct 11, 2024 · 0 comments
Labels
medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) server/webui server stale

Comments

@ivanstepanovftw
Copy link
Collaborator

ivanstepanovftw commented Oct 11, 2024

What happened?

Server somehow corrupted the prompt, so tokens at the end of the every line are lost.

Here is how I run server:

./build/bin/llama-server -m ~/Downloads/qwen2.5-7b-instruct-q4_0-00001-of-00002.gguf

Here is how I test CLI to ensure it is a server bug:

./build/bin/llama-cli -m ~/Downloads/qwen2.5-7b-instruct-q4_0-00001-of-00002.gguf -e -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi\!<|im_end|>\n<|im_start|>assistant\nHow can I assist you today?<|im_end|>\n<|im_start|>user\nImplement fibbonaci in Python<|im_end|>\n<|im_start|>assistant\n" -n 128 -t 7 -tb 8 --temp 0
Here is the output from the CLI

➜  llama.cpp git:(master) ✗ ./build/bin/llama-cli -m ~/Downloads/qwen2.5-7b-instruct-q4_0-00001-of-00002.gguf -e -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi\!<|im_end|>\n<|im_start|>assistant\nHow can I assist you today?<|im_end|>\n<|im_start|>user\nImplement fibbonaci in Python<|im_end|>\n<|im_start|>assistant\n" -n 128 -t 7 -tb 8 --temp 0          
build: 3891 (d5cb8684) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/i/Downloads/qwen2.5-7b-instruct-q4_0-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-7b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-7b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 7.6B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 2
llama_model_loader: - kv  28:                        split.tensors.count i32              = 339
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_0:  197 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 4.12 GiB (4.65 BPW) 
llm_load_print_meta: general.name     = qwen2.5-7b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  3793.03 MiB
llm_load_tensors:        CPU buffer size =   427.40 MiB
.....................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  7168.00 MiB
llama_new_context_with_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =  7452.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 1
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 7

system_info: n_threads = 7 (n_threads_batch = 8) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

sampler seed: 4294967295
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy 
generate: n_ctx = 131072, n_batch = 2048, n_predict = 128, n_keep = 0

system
You are a helpful assistant.
user
Hi!
assistant
How can I assist you today?
user
Implement fibbonaci in Python
assistant
Sure! Here are a few ways to implement the Fibonacci sequence in Python:

1. **Iterative Approach:**
   ```python
   def fibonacci(n):
       if n <= 0:
           return []
       elif n == 1:
           return [0]
       elif n == 2:
           return [0, 1]
       
       fib_sequence = [0, 1]
       for i in range(2, n):
           next_value = fib_sequence[-1] + fib_sequence[-2]
           fib_sequence.append(next_value)
       return fib_sequence

   # Example usage
   print(fibonacci(

llama_perf_sampler_print:    sampling time =      25.14 ms /   172 runs   (    0.15 ms per token,  6840.33 tokens per second)
llama_perf_context_print:        load time =   27227.67 ms
llama_perf_context_print: prompt eval time =    6480.76 ms /    44 tokens (  147.29 ms per token,     6.79 tokens per second)
llama_perf_context_print:        eval time =   20080.14 ms /   127 runs   (  158.11 ms per token,     6.32 tokens per second)
llama_perf_context_print:       total time =   26704.27 ms /   171 tokens
Time: 0h:00m:56s                                                                                                                                                
➜  llama.cpp git:(master) ✗ 

Here is how I test server endpoints to ensure this is a UI bug:

import httpx

# Define the URL and the headers
url = 'http://localhost:8080/completion'
headers = {
    'Content-Type': 'application/json'
}

# Define the JSON payload with properly escaped newlines
data = {
    "prompt": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi!<|im_end|>\n<|im_start|>assistant\nHow can I assist you today?<|im_end|>\n<|im_start|>user\nImplement fibbonaci in Python<|im_end|>\n<|im_start|>assistant\n",
    "n_predict": 128
}

# Send the POST request using httpx with no timeout
response = httpx.post(url, json=data, headers=headers, timeout=None)

# Print the response from the server
print(response.json())

Response from the endpoints are valid:

{'content': 'Sure! Here are a few ways to implement the Fibonacci sequence in Python:\n\n1. **Iterative Approach:**\n   ```python\n   def fibonacci(n):\n       a, b = 0, 1\n       for _ in range(n):\n           a, b = b, a + b\n       return a\n\n   # Example usage\n   n = 10\n   print(f"Fibonacci({n}) = {fibonacci(n)}")\n   ```\n\n2. **Recursive Approach:**\n   ```python\n   def fibonacci(n):\n       if n <= 0:\n           return 0\n       elif n ==', 'id_slot': 0, 'stop': True, 'model': '/home/i/Downloads/qwen2.5-7b-instruct-q4_0-00001-of-00002.gguf', 'tokens_predicted': 128, 'tokens_evaluated': 44, 'generation_settings': {'n_ctx': 131072, 'n_predict': -1, 'model': '/home/i/Downloads/qwen2.5-7b-instruct-q4_0-00001-of-00002.gguf', 'seed': 4294967295, 'seed_cur': 3124811782, 'temperature': 0.800000011920929, 'dynatemp_range': 0.0, 'dynatemp_exponent': 1.0, 'top_k': 40, 'top_p': 0.949999988079071, 'min_p': 0.05000000074505806, 'tfs_z': 1.0, 'typical_p': 1.0, 'repeat_last_n': 64, 'repeat_penalty': 1.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'mirostat': 0, 'mirostat_tau': 5.0, 'mirostat_eta': 0.10000000149011612, 'penalize_nl': False, 'stop': [], 'max_tokens': 128, 'n_keep': 0, 'n_discard': 0, 'ignore_eos': False, 'stream': False, 'n_probs': 0, 'min_keep': 0, 'grammar': '', 'samplers': ['top_k', 'tfs_z', 'typ_p', 'top_p', 'min_p', 'temperature']}, 'prompt': '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi!<|im_end|>\n<|im_start|>assistant\nHow can I assist you today?<|im_end|>\n<|im_start|>user\nImplement fibbonaci in Python<|im_end|>\n<|im_start|>assistant\n', 'truncated': False, 'stopped_eos': False, 'stopped_word': False, 'stopped_limit': True, 'stopping_word': '', 'tokens_cached': 171, 'timings': {'prompt_n': 44, 'prompt_ms': 2533.391, 'prompt_per_token_ms': 57.577068181818184, 'prompt_per_second': 17.368025701520214, 'predicted_n': 128, 'predicted_ms': 17878.5, 'predicted_per_token_ms': 139.67578125, 'predicted_per_second': 7.159437312973684}, 'index': 0}

Here are screenshots:

Old web UI

image

New web UI

image

New web UI Chat

image

SimpleChat

image

llama-cli

image

What is affected:

  • server ui
  • server new ui

Unaffected:

  • server endpoints
  • server SimpleChat
  • CLI

Name and Version

version: 3891 (d5cb868)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

No response

Relevant log output

No response

@ivanstepanovftw ivanstepanovftw added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) server/webui server and removed bug-unconfirmed labels Oct 11, 2024
@github-actions github-actions bot added the stale label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) server/webui server stale
Projects
None yet
Development

No branches or pull requests

1 participant