-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama: implement YaRN RoPE scaling #2268
Conversation
ce59171
to
f3b9eae
Compare
Any guide to set para extrapolation and ntk? How do they work with previous two paras? |
The upstream NTKv2 doesn't use --rope-freq-base, so it probably doesn't make sense to use it. It does use --rope-freq-scale, which works like linear scaling, and is supposed to be calibrated so that e.g. .25 scale actually gives you 8192 context. To use the default NTKv2, you should set --rope-ntk-factor and --rope-extrapolation-factor to 1, and set --rope-freq-scale appropriately. The lower the factors are, the less the respective scaling methods are mixed in, although I believe the graphs have been generated with both at 100% - the code automatically ramps them based on some experimentally determined thresholds. |
I would appreciate help with the following:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename everywhere extrapolation_factor
to ext_factor
No need for backwards implementation for now |
This comment was marked as outdated.
This comment was marked as outdated.
Perplexity with NTKv2 may be worse because neither is the dynamic version, which AFAIK works better on non-finetuned models. But fine-tuned models are far superior anyway. NTKv1 does not converge when fine-tuning, which is why NTKv2 exists. So until somebody publishes a model fine-tuned with NTKv2—maybe LLongMAv2 will be released after jquesnelle publishes the paper based on scaled-rope—the existing LLongMA, which uses regular linear interpolation (just like SuperHOT), is the state-of-the-art for long contexts. |
The paper has been released. The resulting method is called YaRN. Apparently the models that use this technique are good to about 120k tokens of context. More work will definitely be needed to use these models with llama.cpp. |
This comment was marked as resolved.
This comment was marked as resolved.
Thank you for the llamacpp implementation of YaRN! I'm just letting you know that constant float max_pos_emb = 2048; should be changed to 4096 for llama 2 models when using YaRN (default was 2048 because we did the most tests with llama 1 models) |
Thanks for reminding me. I originally made this PR before GGUF was finished, so I hardcoded it in the meantime. I believe I can now use the value of |
Would it be worth testing this with non-YaRN fine-tuned models? If so, any suggested settings? I can test it with ROCM. |
this needs to be a new GGUF kv, something like "rope_yarn_orig_ctx"
|
Exactly, after finetuning a model with YaRN, we have to keep track of two values, one being the original context length (2048 for LLaMA or 4096 for Llama 2), and also the final context length (which can be calculated by multipling the original ctx length by the scale factor, eg. 4096 x 32 = 128Ki) In this case, the constant |
I'd be fine with that solution. Would you like to make a PR? edit: For some reason, I can't reproduce this on Linux with clang or gcc, or on an M2 Mac, at least on CPU. edit 2: I can't build llama.cpp with Metal on my Mac:
Seems like a bug in the XCode-provided clang 15? |
#2268 (comment) - this seems to fix my problem. Really weird that it only has an effect when offloading that last non-repeating layer. |
@cebtenzzre thanks for pushing the pr. Now I'm testing this https://huggingface.co/TheBloke/Yarn-Mistral-7B-64k-GGUF and I'm getting
so something must be wrong, as the base model has
The gguf is recognized correctly
and
|
metal issue is a simple fix: #3937 |
Found Mistral 7b yarn 128k has been released, (Meanwhile seems 320G vram needed for 128k ctx) |
More like 16GB. Where do you get this number from? |
According to @bloc97 in model discuss and he's one of the model's member if I am correct. |
The big number discrepancy probably stems from us not properly implementing mistrals context window shenanigans. |
I could be missing something, but if we implemented the Mistral SWA thing, we would require even less memory |
Yes, also I've heard Mistral relies heavily on Sliding Window Attention even for 4K context. So for best performance, it really should be implemented. |
If you mean the per-layer stuff, the information to implement it really doesn't exist, and their code examples don't include that. Also, they didn't respond to issues in their repo asking for clarification, so... |
Huggingface and pytorch modeling code is much less VRAM efficient than llamacpp because it has to take in account both training and inference use cases (eg. arbitrarily shaped attention masking), and expose internal values for allowing PEFT training. In these scenarios, the kv-cache is extremely inefficient and the models' internal states are also kept, making inference use a huge amount of VRAM. It is possible to rewrite the Llama and Mistral inference code with custom kernels in pytorch but it would break compatibility with all other features (eg. what is done by Exllama or vLLM). |
rope backward process was broken after YaRN RoPE (ggerganov#2268) implementation, due to missing changes in backward functions. the code for the backward process is nearly identically to the forward process: the only difference is the sign of the sin-values. to avoid future regressions remove the near-duplicate backward functions and reuse the forward code: for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`. the sin-values will be negated when forward is false.
* fix backward process of rope rope backward process was broken after YaRN RoPE (#2268) implementation, due to missing changes in backward functions. the code for the backward process is nearly identically to the forward process: the only difference is the sign of the sin-values. to avoid future regressions remove the near-duplicate backward functions and reuse the forward code: for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`. the sin-values will be negated when forward is false. * fix finetune rope call to use correct default attn_factor of 1.0f * remove unused `ggml_rope_xpos_back` it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants. * fix comments explaining the sinus sign in ggml_forward_rope * add missing function arguments in declaration * fix function argument type in declaration
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
* fix backward process of rope rope backward process was broken after YaRN RoPE (ggerganov#2268) implementation, due to missing changes in backward functions. the code for the backward process is nearly identically to the forward process: the only difference is the sign of the sin-values. to avoid future regressions remove the near-duplicate backward functions and reuse the forward code: for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`. the sin-values will be negated when forward is false. * fix finetune rope call to use correct default attn_factor of 1.0f * remove unused `ggml_rope_xpos_back` it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants. * fix comments explaining the sinus sign in ggml_forward_rope * add missing function arguments in declaration * fix function argument type in declaration
The NeoX cur_rot part is different because I'm pretty sure my original implementation was wrong.
The NeoX cur_rot part is different because I'm pretty sure my original implementation was wrong.
This is an implementation of YaRN RoPE scaling. See https://github.com/jquesnelle/yarn and the paper and errata.
TODO: