Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: "main : failed to eval" with Self-extend and small context #8570

Closed
rhvall opened this issue Jul 18, 2024 · 7 comments
Closed

Bug: "main : failed to eval" with Self-extend and small context #8570

rhvall opened this issue Jul 18, 2024 · 7 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@rhvall
Copy link

rhvall commented Jul 18, 2024

What happened?

I have been playing with the context window and I have been encountering issues running the "Llama-3-Smaug-q2_k.gguf" model. When I run llama-cli with that model using the default execution with the command below, the program behaves as expected

out/bin/llama-cli -m $MODEL -ngl 99 -c 1024 -b 256 --repeat_penalty 1.1 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --override-kv tokenizer.ggml.pre=str:llama3

However, when the "Self-Extend" is enabled (gan/gaw) in interactive mode, after a while (> than context) it crashes with main : failed to eval. Here is the command:

out/bin/llama-cli -m $MODEL -ngl 99 -c 1024 -b 256 --repeat_penalty 1.1 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --override-kv tokenizer.ggml.pre=str:llama3 -gan 2 -gaw 256

Below is a relevant log output asking these questions:

explain how to create pancakes step by step
what about cakes?
explain how to create a video in blender3D
what else can I do in that software?
are there alternatives to it?
explain why the middle east is a really conflicting place
which are the most conflicting countries?
what are the requirements to become president of Bulgaria
continue

Also, I noticed that the "examples/passkey" has a different implementation for the "Self-extend" code as it does "examples/main". Which one is the correct one?

Thanks for your help.

Name and Version

llama-cli -v
version: 3392 (bda62d7)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0

What operating system are you seeing the problem on?

Mac

Relevant log output

main: build = 3392 (bda62d79)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0
main: seed  = 1721308668
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from ../models/Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 10
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q2_K:  129 tensors
llama_model_loader: - type q3_K:   64 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q6_K:    1 tensors
validate_override: Using metadata override (  str) 'tokenizer.ggml.pre' = llama3
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
validate_override: Using metadata override (  str) 'tokenizer.ggml.pre' = llama3
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 2.95 GiB (3.16 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =  2860.02 MiB, ( 2860.08 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   164.39 MiB
llm_load_tensors:      Metal buffer size =  2860.00 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
llama_kv_cache_init:      Metal KV buffer size =   128.00 MiB
llama_new_context_with_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      Metal compute buffer size =   129.25 MiB
llama_new_context_with_model:        CPU compute buffer size =     5.00 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | 
main: interactive mode on.
Reverse prompt: 'User:'
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 1024, n_batch = 256, n_predict = -1, n_keep = 1
self-extend: n_ctx_train = 8192, grp_attn_n = 2, grp_attn_w = 256


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:explain how to create pancakes step by step
Bob: Ahaha! I'd be happy to help you with that. Here's a simple recipe to make delicious pancakes:

Step 1: In a bowl, whisk together 1 cup of milk, 2 eggs, and 1 tablespoon of oil.
Step 2: Add 1/2 teaspoon baking powder, salt, and vanilla extract. Mix well.
Step 3: Heat a non-stick pan over medium heat with butter or cooking spray.
Step 4: When the pan is hot, scoop about 1/4 cup of batter onto the pan.
Step 5: Cook for about 2 minutes and flip the pancake.
Step 6: Cook for another minute and serve warm.

Would you like any variations on this recipe? Bob
User:what about cakes?
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
Bob: Ahaha! You must be referring to birthday cakes. I'd be happy to help you with that too!
Here's a simple recipe for a basic cake:

Step 1: Preheat oven to 350°F (175°C).
Step 2: Mix together 1 cup of sugar, 1/2 cup brown sugar, and 1 teaspoon baking powder in a bowl.
Step 3: Add eggs one by one while mixing well after each addition.
Step 4: In another bowl, mix together flour, butter, and buttermilk.
Step 5: Combine the wet and dry ingredients together. Mix until smooth.
Step 6: Pour batter into greased pans and bake for 25-35 minutes.

Do you have any questions about this recipe? Bob
User:explain how to create a video in blender3D
Bob: Ahaha! I'd be happy to help you with that. Here's a simple guide on creating a video using Blender:

Step 1: Download and install Blender from their official website.
Step 2: Launch Blender and create a new project by clicking the "New" button.
Step 3: In the "Project Settings," choose the resolution, frame rate, and format you wantggml_backend_sched_alloc_splits: failed to allocate graph, reserving
 for your video.
Step 4: Choose a camera angle or object to render.
Step 5: Use keyframes to control animations.

Would you like any tips on rendering? Bob
User:what else can I do in that software?
Bob: Ahaha! You're just getting started with Blender. Here are some things you can do:

1. Create 3D models and animations using Blender's built-in tools.
2. Render images, videos, or even animated GIFs.
3. Use the software for 2D and 3D rendering tasks.

You have so many creative possibilities with Blender! Bob
User:are there alternatives to it?
Bob: Ahaha! Yes, there are alternative software options available. Here are a few:

1. LightWave 3D
2. Maya
3. Cinema4D

These are popular alternatives you can consider.

Would you like any more information on those? Bob
User:explain why the middle east is a really conflicting place
Bob: Ahaha! Let me help you understand that. The Middle East has been a complex and volatile region with many historical conflicts.

Step 1: Identify the factors contributing to tensions, such as religion, politics, resources, and territorial disputes.
Step 2ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
: Recognize the complexity of these issues, including the roles of various actors like governments, terrorist groups, and foreign powers.
Step 3: Understand that each side has a different perspective on events.

Would you like any more information on this topic? Bob
User:which are the most conflicting countries?
Bob: Ahaha! According to various sources, some of the most conflicting countries in the Middle East include:

1. Israel and Palestine
2. Iran and Saudi Arabia
3. Iraq and Turkey

These have been hotspots for years, with ongoing disputes.

Would you like any more information on these? Bob
User:what are the requirements to become president of Bulgaria
Bob: Ahaha! To become President of Bulgaria, you'll need:

1. Bulgarian citizenship
2. Age at least 35 and not exceeding 65 years old
3. Residency in Bulgaria for at least three years prior to election

You can find more information on these requirements here.

Would you like any more details? Bob
User:continue
Bob: Ahaha! You're welcome to continue exploring topics, Bob. Just type your next question or topic you'd like assistance with, and I'll do my best to help. Have any more questions? Bob
User:
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
main : failed to eval
@rhvall rhvall added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 18, 2024
@rhvall
Copy link
Author

rhvall commented Jul 24, 2024

Checking out the PRs on llamaCPP that implemented self extend:
#4810
#4815
#5104

And checking the original paper:
https://arxiv.org/pdf/2401.01325.pdf

And taking a look at the pseudocode:
Screenshot 2024-07-24 at 5 01 14 PM

I was wondering, the following:

  • Where in the code is applied the merge of the normal attention with the grouped attention?
  • Where do we apply the softmax operation for the attention_weights?

@ggerganov
Copy link
Owner

The SelfExtend implementation in llama.cpp is not exactly the same as the reference implementation. I think the results should be the same or similar, but not 100% sure. The llama_kv_cache_ calls are used to modify the positions of the tokens in the K cache via RoPE shifts, after the normal attention is computed. This functionality is used to emulate the grouped attention.

The 2 implementations in main and passkey should be doing the same stuff, though I'm looking at the code now and realize it's quite confusing. Need to reimplement this at some point.

Not sure why you are crashing though - that's strange

@rhvall
Copy link
Author

rhvall commented Jul 24, 2024

Thanks @ggerganov for taking a look at this. I will check those RoPE shifts.

Regarding the crash, I ran again the llama-cli with the following command, inspired by your example in PR#4815:

out/bin/llama-cli -m "../models/Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf" \
  -f "./prompts/chat-with-bob.txt" \
  -n 256 -c 16384 -s 1 \
  --grp-attn-n 4 --grp-attn-w 2048 \
  -i -r "User:" --color --repeat-penalty 1.0 \
  --override-kv tokenizer.ggml.pre=str:llama3

I let it run for a while, with multiple times seeing this print

n_past_old = 2048, n_past = 512, ga_i = 512
...
n_past_old = 2560, n_past = 1024, ga_i = 1024
...
n_past_old = 3072, n_past = 1536, ga_i = 1536
...
n_past_old = 3584, n_past = 2048, ga_i = 2048
...
n_past_old = 4096, n_past = 2560, ga_i = 2560
...
n_past_old = 4608, n_past = 3072, ga_i = 3072
...
n_past_old = 5120, n_past = 3584, ga_i = 3584
...
n_past_old = 5632, n_past = 4096, ga_i = 4096
...

ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
main : failed to eval

Also, I was printing the "n_past" and "llama_get_kv_cache_token_count"

NPast: 1140, kvCount: 1140
NPast: 1508, kvCount: 1508
NPast: 1509, kvCount: 1509
NPast: 565, kvCount: 2101
NPast: 821, kvCount: 2357
NPast: 1077, kvCount: 2613
NPast: 1333, kvCount: 2869
NPast: 1589, kvCount: 3125
NPast: 1845, kvCount: 3381
NPast: 2101, kvCount: 3637
NPast: 2357, kvCount: 3893
NPast: 1077, kvCount: 4149
NPast: 1333, kvCount: 4405
NPast: 1589, kvCount: 4661
NPast: 1845, kvCount: 4917
NPast: 2101, kvCount: 5173
NPast: 2357, kvCount: 5429
NPast: 2613, kvCount: 5685
NPast: 2869, kvCount: 5941
NPast: 1589, kvCount: 6197
NPast: 1845, kvCount: 6453
NPast: 2101, kvCount: 6709
NPast: 2357, kvCount: 6965
NPast: 2613, kvCount: 7221
NPast: 2869, kvCount: 7477
NPast: 3125, kvCount: 7733
NPast: 3381, kvCount: 7989
NPast: 2101, kvCount: 8245
NPast: 2357, kvCount: 8501
NPast: 2613, kvCount: 8757
NPast: 2869, kvCount: 9013
NPast: 3125, kvCount: 9269
NPast: 3381, kvCount: 9525
NPast: 3637, kvCount: 9781
NPast: 3893, kvCount: 10037
NPast: 2613, kvCount: 10293
NPast: 2869, kvCount: 10549
NPast: 3125, kvCount: 10805
NPast: 3381, kvCount: 11061
NPast: 3637, kvCount: 11317
NPast: 3893, kvCount: 11573
NPast: 4149, kvCount: 11829
NPast: 4405, kvCount: 12085
NPast: 3125, kvCount: 12341
NPast: 3381, kvCount: 12597
NPast: 3637, kvCount: 12853
NPast: 3893, kvCount: 13109
NPast: 4149, kvCount: 13365
NPast: 4405, kvCount: 13621
NPast: 4661, kvCount: 13877
NPast: 4917, kvCount: 14133
NPast: 3637, kvCount: 14389
NPast: 3893, kvCount: 14645
NPast: 4149, kvCount: 14901
NPast: 4405, kvCount: 15157
NPast: 4661, kvCount: 15413
NPast: 4917, kvCount: 15669
NPast: 5173, kvCount: 15925
NPast: 5429, kvCount: 16181

As you can see, the NPast changes with the "Self-extend" code. However, the kvCount, seems to crash when the "n_ctx" is exceeded.

@ggerganov
Copy link
Owner

Ah wait, if you exceed the -c argument, it is expected to crash.

When not using SelfExtend, you can exceed the -c value (this is effectively the size of the KV cache buffer in terms of tokens) because in that case, we do "context shift" (again utilizing a RoPE shift) to "forget" some of the oldest tokens in the KV cache and free space for new tokens.

When SelfExtend is enabled, we can no longer exceed the -c value and this error is expected if you exceed it. We cannot exceed it, because due to the modified attention, we can no longer perform the "context shift" trick.

@rhvall
Copy link
Author

rhvall commented Jul 25, 2024

If I understand well, as it currently stands, the KV cache can't mix SelfExtend with the basic "context shift". This would prevent its use for the chat application, right??

@ggerganov
Copy link
Owner

I wouldn't say it prevents chat application - any application that goes beyond the training context of the model relies on some sort of a hack. With context shift, you lose some of the old context when you go beyond the limit. With SE, you seemingly extend the size of the context, but it's not without its deficiencies as well.

For example, with 8192 training context, you can:

  • Chat without any issues up to 8192 tokens
  • With context shift, chat up to lets say 32k tokens, relying on multiple shifts and "forgetting" parts of the initial conversation
  • With SE 4x extended context, chat up to 32k tokens without any shifts and probably not 100% perfect memory

So you can use either strategy based on your use case. As long as you are in the training context there will be no issues. Beyond that - it might or might not work

@rhvall
Copy link
Author

rhvall commented Jul 25, 2024

Thanks for going deep on your explanation. You are awesome.

I will close this issue given that the current behavior of llama-cli is what's expected given that exceeding the context length with Self-Extend is not correct execution.

The only missing part would be code differences between the passkey and llama-cli examples when using Self-Extend, but that could be part of another issue.

@rhvall rhvall closed this as completed Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

No branches or pull requests

2 participants