Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama: implement YaRN RoPE scaling #2268

Merged
merged 36 commits into from
Nov 1, 2023
Merged

Conversation

cebtenzzre
Copy link
Collaborator

@cebtenzzre cebtenzzre commented Jul 18, 2023

This is an implementation of YaRN RoPE scaling. See https://github.com/jquesnelle/yarn and the paper and errata.

TODO:

  • Add new GGUF key for how much context the base model was trained on
  • Support converting the new models to GGUF
  • Add backward implementations
  • Test new LLaMA implementation
  • Finish and test Falcon implementation

@cebtenzzre cebtenzzre force-pushed the ntkv2 branch 3 times, most recently from ce59171 to f3b9eae Compare July 19, 2023 03:55
@cebtenzzre cebtenzzre changed the title llama: implement NTK-By-Parts (NTKv2) llama: implement NTK-By-Parts (NTKv2) RoPE scaling Jul 19, 2023
@FNsi
Copy link
Contributor

FNsi commented Jul 20, 2023

Any guide to set para extrapolation and ntk? How do they work with previous two paras?

@cebtenzzre
Copy link
Collaborator Author

The upstream NTKv2 doesn't use --rope-freq-base, so it probably doesn't make sense to use it. It does use --rope-freq-scale, which works like linear scaling, and is supposed to be calibrated so that e.g. .25 scale actually gives you 8192 context. To use the default NTKv2, you should set --rope-ntk-factor and --rope-extrapolation-factor to 1, and set --rope-freq-scale appropriately. The lower the factors are, the less the respective scaling methods are mixed in, although I believe the graphs have been generated with both at 100% - the code automatically ramps them based on some experimentally determined thresholds.

@cebtenzzre cebtenzzre marked this pull request as ready for review July 21, 2023 22:04
@cebtenzzre
Copy link
Collaborator Author

I would appreciate help with the following:

  • Should I try to write a backwards implementation? NTKv1 still doesn't have one, so I don't have much to base it on.
  • I don't have a Mac to test the Metal code on. If anyone sees obvious flaws or can test it locally, let me know.
  • I'm going to try to run a perplexity benchmark against NTKv1 and linear scaling, but I don't know if my current hardware is up to the task.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename everywhere extrapolation_factor to ext_factor

ggml.c Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

No need for backwards implementation for now

@cebtenzzre

This comment was marked as outdated.

@cebtenzzre
Copy link
Collaborator Author

Perplexity with NTKv2 may be worse because neither is the dynamic version, which AFAIK works better on non-finetuned models. But fine-tuned models are far superior anyway.

NTKv1 does not converge when fine-tuning, which is why NTKv2 exists. So until somebody publishes a model fine-tuned with NTKv2—maybe LLongMAv2 will be released after jquesnelle publishes the paper based on scaled-rope—the existing LLongMA, which uses regular linear interpolation (just like SuperHOT), is the state-of-the-art for long contexts.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Aug 31, 2023

The paper has been released. The resulting method is called YaRN. Apparently the models that use this technique are good to about 120k tokens of context.
Screenshot from 2023-08-31 16-53-18

More work will definitely be needed to use these models with llama.cpp.

@cebtenzzre cebtenzzre changed the title llama: implement NTK-By-Parts (NTKv2) RoPE scaling llama: implement YaRN RoPE scaling Sep 5, 2023
@cebtenzzre

This comment was marked as resolved.

@bloc97
Copy link

bloc97 commented Sep 6, 2023

Thank you for the llamacpp implementation of YaRN!

I'm just letting you know that

constant float max_pos_emb = 2048;

should be changed to 4096 for llama 2 models when using YaRN (default was 2048 because we did the most tests with llama 1 models)
This value should probably be saved inside of the model configs and be loaded on inference...

@cebtenzzre
Copy link
Collaborator Author

should be changed to 4096 for llama 2 models

Thanks for reminding me. I originally made this PR before GGUF was finished, so I hardcoded it in the meantime. I believe I can now use the value of llama.context_length for this purpose.

@KerfuffleV2
Copy link
Collaborator

Would it be worth testing this with non-YaRN fine-tuned models? If so, any suggested settings? I can test it with ROCM.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Sep 6, 2023

Thank you for the llamacpp implementation of YaRN!

I'm just letting you know that

constant float max_pos_emb = 2048;

should be changed to 4096 for llama 2 models when using YaRN (default was 2048 because we did the most tests with llama 1 models) This value should probably be saved inside of the model configs and be loaded on inference...

this needs to be a new GGUF kv, something like "rope_yarn_orig_ctx"

Thanks for reminding me. I originally made this PR before GGUF was finished, so I hardcoded it in the meantime. I believe I can now use the value of llama.context_length for this purpose.

llama.context_length should be the size of the finetune. eg 128Ki

@cebtenzzre cebtenzzre marked this pull request as draft September 6, 2023 15:50
@bloc97
Copy link

bloc97 commented Sep 6, 2023

this needs to be a new GGUF kv, something like "rope_yarn_orig_ctx"

Exactly, after finetuning a model with YaRN, we have to keep track of two values, one being the original context length (2048 for LLaMA or 4096 for Llama 2), and also the final context length (which can be calculated by multipling the original ctx length by the scale factor, eg. 4096 x 32 = 128Ki)

In this case, the constant constant float max_pos_emb = 2048; used in the equations must be equal to the original context size, not the final context size.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Nov 2, 2023

If ext_factor would never go negative,

I'd be fine with that solution. Would you like to make a PR?

edit: For some reason, I can't reproduce this on Linux with clang or gcc, or on an M2 Mac, at least on CPU.

edit 2: I can't build llama.cpp with Metal on my Mac:

c++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_METAL  -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Ofast -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi  examples/main/main.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o -o main -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
0  0x102f3b648  __assert_rtn + 72
1  0x102e63c5c  ld::Fixup::applyFixup(ld::Atom const*, ld::LayoutLinkedImage const&, unsigned char*) const + 8268
2  0x102ef67d8  ___ZN2ld16LayoutExecutable27writeContentWithoutLinkEditENSt3__14spanIhLm18446744073709551615EEEy_block_invoke + 332
3  0x102ef6a14  void mapReduce<ld::Atom const*, mach_o::Error>(std::__1::span<ld::Atom const*, 18446744073709551615ul>, unsigned long, void (unsigned long, mach_o::Error&, std::__1::span<ld::Atom const*, 18446744073709551615ul>) block_pointer, void (std::__1::span<mach_o::Error, 18446744073709551615ul>) block_pointer) + 384
4  0x102ef6594  ld::LayoutExecutable::writeContentWithoutLinkEdit(std::__1::span<unsigned char, 18446744073709551615ul>, unsigned long long) + 1180
5  0x102efc020  ld::LayoutExecutable::writeToFile(char const*) + 15248
6  0x102eae2e8  main + 9424
ld: Assertion failed: (extras.otherInstrOffset != 0 && "Kind::arm64_adrp_ldr missing extra info"), function applyFixup, file Fixup.cpp, line 793.
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [main] Error 1

Seems like a bug in the XCode-provided clang 15?

@KerfuffleV2
Copy link
Collaborator

#2268 (comment) - this seems to fix my problem. Really weird that it only has an effect when offloading that last non-repeating layer.

@jxy
Copy link
Contributor

jxy commented Nov 3, 2023

@cebtenzzre thanks for pushing the pr.

Now I'm testing this https://huggingface.co/TheBloke/Yarn-Mistral-7B-64k-GGUF and I'm getting

$ ./perplexity -t 1 -ngl 1 -m models/yarn-mistral-7b-64k.Q8_0.gguf -c 512 -f ../wikitext-2-raw/wiki.test.raw 2>/dev/null
[1]24.7243,[2]31.1885,[3]36.5431,[4]41.0809,^C

so something must be wrong, as the base model has

$ ./perplexity -t 1 -ngl 1 -m models/mistral-7b-v0.1.Q8_0.gguf -c 512 -f ../wikitext-2-raw/wiki.test.raw 2>/dev/null   
[1]3.9958,[2]4.4960,[3]5.2987,[4]5.9971,^C

The gguf is recognized correctly

llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.125
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = yes

and

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.125
llama_new_context_with_model: kv self size  =   64.00 MB

@jxy
Copy link
Contributor

jxy commented Nov 3, 2023

metal issue is a simple fix: #3937

@FNsi
Copy link
Contributor

FNsi commented Nov 4, 2023

Found Mistral 7b yarn 128k has been released,

mistral 7b yarn

(Meanwhile seems 320G vram needed for 128k ctx)

@ggerganov
Copy link
Owner

(Meanwhile seems 320G vram needed for 128k ctx)

More like 16GB. Where do you get this number from?
At lest for vanilla Mistral with 8 KV heads it's 1GB per 8k context

@FNsi
Copy link
Contributor

FNsi commented Nov 4, 2023

(Meanwhile seems 320G vram needed for 128k ctx)

More like 16GB. Where do you get this number from?

At lest for vanilla Mistral with 8 KV heads it's 1GB per 8k context

According to @bloc97 in model discuss and he's one of the model's member if I am correct.

@Green-Sky
Copy link
Collaborator

The big number discrepancy probably stems from us not properly implementing mistrals context window shenanigans.

@ggerganov
Copy link
Owner

I could be missing something, but if we implemented the Mistral SWA thing, we would require even less memory

@Dampfinchen
Copy link

Dampfinchen commented Nov 6, 2023

The big number discrepancy probably stems from us not properly implementing mistrals context window shenanigans.

Yes, also I've heard Mistral relies heavily on Sliding Window Attention even for 4K context.

So for best performance, it really should be implemented.

@KerfuffleV2
Copy link
Collaborator

So for best performance, it really should be implemented.

If you mean the per-layer stuff, the information to implement it really doesn't exist, and their code examples don't include that. Also, they didn't respond to issues in their repo asking for clarification, so...

@bloc97
Copy link

bloc97 commented Nov 6, 2023

The big number discrepancy probably stems from us not properly implementing mistrals context window shenanigans.

Huggingface and pytorch modeling code is much less VRAM efficient than llamacpp because it has to take in account both training and inference use cases (eg. arbitrarily shaped attention masking), and expose internal values for allowing PEFT training. In these scenarios, the kv-cache is extremely inefficient and the models' internal states are also kept, making inference use a huge amount of VRAM. It is possible to rewrite the Llama and Mistral inference code with custom kernels in pytorch but it would break compatibility with all other features (eg. what is done by Exllama or vLLM).

xaedes added a commit to xaedes/llama.cpp that referenced this pull request Nov 6, 2023
rope backward process was broken after YaRN RoPE (ggerganov#2268) implementation, due to missing changes in backward functions.

the code for the backward process is nearly identically to the forward process:
the only difference is the sign of the sin-values.

to avoid future regressions remove the near-duplicate backward functions and reuse the forward code:

for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`.
the sin-values will be negated when forward is false.
ggerganov pushed a commit that referenced this pull request Nov 7, 2023
* fix backward process of rope

rope backward process was broken after YaRN RoPE (#2268) implementation, due to missing changes in backward functions.

the code for the backward process is nearly identically to the forward process:
the only difference is the sign of the sin-values.

to avoid future regressions remove the near-duplicate backward functions and reuse the forward code:

for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`.
the sin-values will be negated when forward is false.

* fix finetune rope call to use correct default attn_factor of 1.0f

* remove unused `ggml_rope_xpos_back`

it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants.

* fix comments explaining the sinus sign in ggml_forward_rope

* add missing function arguments in declaration

* fix function argument type in declaration
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
* fix backward process of rope

rope backward process was broken after YaRN RoPE (ggerganov#2268) implementation, due to missing changes in backward functions.

the code for the backward process is nearly identically to the forward process:
the only difference is the sign of the sin-values.

to avoid future regressions remove the near-duplicate backward functions and reuse the forward code:

for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`.
the sin-values will be negated when forward is false.

* fix finetune rope call to use correct default attn_factor of 1.0f

* remove unused `ggml_rope_xpos_back`

it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants.

* fix comments explaining the sinus sign in ggml_forward_rope

* add missing function arguments in declaration

* fix function argument type in declaration
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Nov 23, 2023
The NeoX cur_rot part is different because I'm pretty sure my original
implementation was wrong.
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Nov 23, 2023
The NeoX cur_rot part is different because I'm pretty sure my original
implementation was wrong.
@cebtenzzre cebtenzzre removed the demo Demonstrate some concept or idea, not intended to be merged label Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.