Add llama 3.1 rope scaling factors to llama conversion and inference #8676

jmorganca · 2024-07-24T19:52:17Z

Hi all, this commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the ggml_rope_ext rope operation.

From our testing, this really improves results for context windows above 8192 for Llama 3.1.

This should fix #8650

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

convert_hf_to_gguf.py

Galunid · 2024-07-24T20:22:18Z

convert_hf_to_gguf.py

@@ -1514,6 +1514,35 @@ def set_gguf_parameters(self):
        if self.hparams.get("vocab_size", 32000) == 49152:
            self.gguf_writer.add_add_bos_token(False)

+        if rope_scaling := self.find_hparam(["rope_scaling"], optional=True):
+            if rope_scaling.get("rope_type", '').lower() == "llama3":


I think it's best if you merge this into the rope scaling above (line 1501), so that all the rope operations are in one place.

jmorganca · 2024-07-24T20:32:54Z

Marking this as draft as there may also be a change required to the rope kernels – investigating!

jxy · 2024-07-24T22:02:31Z

Probably it's not a good idea to change the kernels. I just tested creating the factors instead as in my comment above. Here's the changes on top of your PR,

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 4422948f..897a6e33 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -1518,7 +1518,7 @@ class LlamaModel(Model):
             if rope_scaling.get("rope_type", '').lower() == "llama3":
                 base = hparams.get("rope_theta", 10000.0)
                 dim = int((hparams["hidden_size"] // hparams["num_attention_heads"]) * hparams.get("partial_rotary_embeddings", 1.0))
-                inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+                freqs = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
 
                 factor = rope_scaling.get("factor", 8.0)
                 low_freq_factor = rope_scaling.get("low_freq_factor", 1.0)
@@ -1528,20 +1528,20 @@ class LlamaModel(Model):
                 low_freq_wavelen = old_context_len / low_freq_factor
                 high_freq_wavelen = old_context_len / high_freq_factor
 
-                rope_freqs = []
-                for freq in inv_freq:
+                rope_factors = []
+                for freq in freqs:
                     wavelen = 2 * math.pi / freq
                     if wavelen < high_freq_wavelen:
-                        rope_freqs.append(freq)
+                        rope_factors.append(1)
                     elif wavelen > low_freq_wavelen:
-                        rope_freqs.append(freq / factor)
+                        rope_factors.append(factor)
                     else:
                         assert low_freq_wavelen != high_freq_wavelen
                         smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
-                        rope_freqs.append((1 - smooth) * freq / factor + smooth * freq)
+                        rope_factors.append(1 / ((1 - smooth) / factor + smooth))
 
                 self.gguf_writer.add_rope_scaling_attn_factors(1.0)
-                self.gguf_writer.add_tensor(gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ROPE_FREQS] + ".weight", 1.0 / np.array(rope_freqs))
+                self.gguf_writer.add_tensor(gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ROPE_FREQS] + ".weight", np.array(rope_factors, dtype = np.float32))
 
     @staticmethod
     def permute(weights: Tensor, n_head: int, n_head_kv: int | None):

A quick test shows perplexity is slightly better than master branch ([1]4.6850,[2]5.9231),

perplexity: calculating perplexity over 17 chunks, n_ctx=16384, batch_size=2048, n_seq=1
perplexity: 191.29 seconds per pass - ETA 54.18 minutes
[1]4.3272,[2]5.5469,^C

jxy · 2024-07-24T22:35:48Z

convert_hf_to_gguf.py

+                    else:
+                        assert low_freq_wavelen != high_freq_wavelen
+                        smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
+                        rope_factors.append(1 / (1 - smooth) * factor + smooth)


We want the invert of https://github.com/meta-llama/llama-models/blob/1b5892739868e5333fb7f022ba91218f0ae5f9c2/models/llama3_1/api/model.py#L62 factoring out the freq, which should be 1/((1 - smooth) / scale_factor + smooth)

jxy · 2024-07-24T22:44:05Z

It should be

1 / ((1 - smooth) / factor + smooth)

jmorganca · 2024-07-24T22:51:10Z

Awesome, thanks @jxy. Should be updated now. Agreed much better without the kernel changes.

convert_hf_to_gguf.py

kallewoof · 2024-07-25T07:33:42Z

Nice. Tested doing a bunch of summaries using up the entire 128k context and the output looks good whereas on master it outputs broken garbage.

MoonRide303 · 2024-07-25T12:29:40Z

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
  File "koboldcpp.py", line 4208, in <module>
    main(parser.parse_args(),start_server=True)
  File "koboldcpp.py", line 3883, in main
    loadok = load_model(modelname)
  File "koboldcpp.py", line 773, in load_model
    ret = handle.load_model(inputs)
OSError: exception: access violation reading 0x0000000000001894
[21452] Failed to execute script 'koboldcpp' due to unhandled exception!

FYI @LostRuins

tristandruyen · 2024-07-25T13:08:45Z

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

Isn't this just a side-effect of kobold.cpp using a different llama.cpp version internally, which would need to be updated anyways to take advantage of these changes ?

Nexesenex · 2024-07-25T13:15:29Z

@MoonRide303 : I merged this PR in Kobold.cpp Frankenstein this morning (v171013), and it works (sensical inference) without problem beyond 8k context now.

MoonRide303 · 2024-07-25T13:15:38Z

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

Isn't this just a side-effect of kobold.cpp using a different llama.cpp version internally, which would need to be updated anyways to take advantage of these changes ?

I tested it on koboldcpp 1.71 (released today, having merged upstream changes yesterday) - but you might be right some changes from original llama.cpp are missing (or some extra changes might be problematic). I am not sure if it's something to worry about on the llama.cpp project side, just pointing it out as potential side effect - for some people new GGUFs might not work.

LostRuins · 2024-07-25T13:22:35Z

I wasn't aware that this PR breaks previously quantized Llama 3.1 models, I thought that all prior models would continue to work.
Shouldn't the tensor count always match what's declared in the metadata regardless?

Anyway, I'll do a patch release once this is merged into master.

MoonRide303 · 2024-07-25T13:26:44Z

I wasn't aware that this PR breaks previously quantized Llama 3.1 models, I thought that all prior models would continue to work. Shouldn't the tensor count always match what's declared in the metadata regardless?

Anyway, I'll do a patch release once this is merged into master.

Old GGUFs will continue to work, it's just newly converted models (with this PR changes) that might not work with older llama.cpp-based apps (like koboldcpp 1.71).

LostRuins · 2024-07-25T13:45:00Z

Yeah but that's my point - it's failing to load because the tensor counts don't match the metadata, which it should in both cases, unless I misunderstand the error.

kallewoof · 2024-07-25T13:53:21Z

Merging with koboldcpp concedo_experimental works fine on my end, FWIW.

schmorp · 2024-07-25T13:56:06Z

Why couldn't this tensor be added by llama.cpp when loading? superficially it doesn't make much sense to bake the rope config into the model at conversion time, and prevents bugfixes at a later time.

compilade · 2024-07-25T13:50:11Z

convert_hf_to_gguf.py

+                        rope_factors.append(1 / ((1 - smooth) / factor + smooth))
+
+                self.gguf_writer.add_rope_scaling_attn_factors(1.0)
+                self.gguf_writer.add_tensor(self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FREQS), np.array(rope_factors, dtype=np.float32))


This will be included in --vocab-only models because it's in set_gguf_parameters, while the tensor data is not included in vocab-only files.

But the tensor count is still in the header of GGUF models, so I expect vocab-only files of Llama-3.1 to be malformed because of this.

This tensor could likely instead be "inserted" in the override of self.prepare_tensors to add it before the others and then call super().prepare_tensors(). If it's cleaner, it's possible to keep inside set_gguf_parameters the code which builds the tensor content, as long as the tensor is only "added" in prepare_tensors.

jmorganca · 2024-07-25T15:07:35Z

Thanks! Given the hard check on the tensor count at load time, it seems we might need to instead add the rope parameters as metadata and calculate it at runtime. Definitely did not want to break backwards compatibility with old GGUF files with this change sorry!

ggerganov · 2024-07-25T16:48:31Z

Thank you for looking into this. Btw even if it is in the metadata and the rope freqs are computed at runtime, it will still not be backwards compatible because the old GGUF files will not have the metadata

gilbertgong · 2024-07-25T19:06:20Z

Storing the information as metadata and computing at runtime seems preferable to me, as I think it seems more correct to have the tensor count match that of loading the model in HF transformers, for example (and just reduce potential confusion and people making incorrect assumptions based on mismatched tensor count). One could bake in defaults for when the metadata is missing from the gguf - that would make it backwards compatible, if desired.

3Simplex · 2024-07-25T22:26:55Z

Thank you for your effort on this, we are all waiting patiently for the rope to be finished! We have some great functionality in GPT4All with this new model!

qnixsynapse · 2024-07-26T14:08:53Z

A program crash will queue the user to investigate what's wrong. Degraded performance just might make the user think the software/model is of low quality.

Exactly! I see no reason to make it backward compatible.

LostRuins · 2024-07-26T14:38:35Z

Crashing outright is bad, a better result would be a warning printed to stdout. It's always better to fail gracefully when possible.

sort of like how we did this:

llama.cpp/src/llama.cpp

Lines 5344 to 5352 in 01245f5

    
           if (tokenizer_pre.empty()) { 
        
               LLAMA_LOG_WARN("%s: missing pre-tokenizer type, using: 'default'\n", __func__); 
        
               LLAMA_LOG_WARN("%s:                                             \n", __func__); 
        
               LLAMA_LOG_WARN("%s: ************************************        \n", __func__); 
        
               LLAMA_LOG_WARN("%s: GENERATION QUALITY WILL BE DEGRADED!        \n", __func__); 
        
               LLAMA_LOG_WARN("%s: CONSIDER REGENERATING THE MODEL             \n", __func__); 
        
               LLAMA_LOG_WARN("%s: ************************************        \n", __func__); 
        
               LLAMA_LOG_WARN("%s:                                             \n", __func__); 
        
               vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;

slaren · 2024-07-26T14:40:42Z

It doesn't crash, it causes llama_load_model_from_file to return an error.

bartowski1182 · 2024-07-26T19:05:27Z

convert_hf_to_gguf.py

+                    elif wavelen > low_freq_wavelen:
+                        rope_factors.append(factor)
+                    else:
+                        assert low_freq_wavelen != high_freq_wavelen


should we assert if they're even or just not apply it at all and return gracefully?

I think this assertion should be outside of the loop. These values don't change within the loop. If it's gracefully handled, it should at least print a warning with something like logger.warning("rope freq high and low wavelengths can't be equal") (or something more informative).

(EDIT: woah, answering a comment while making a review is weird and seems to duplicate the message, with both instances having the same permalink)

gilbertgong · 2024-07-26T21:19:01Z

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

oldgithubman · 2024-07-26T21:19:43Z

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

We need more of this mentality around here. There's usually a way

m18coppola · 2024-07-26T21:29:54Z

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

If so, I think there should be a warning similar to what @LostRuins mentioned.

convert_hf_to_gguf.py

compilade · 2024-07-26T21:29:17Z

convert_hf_to_gguf.py

+                    elif wavelen > low_freq_wavelen:
+                        rope_factors.append(factor)
+                    else:
+                        assert low_freq_wavelen != high_freq_wavelen


I think this assertion should be outside of the loop. These values don't change within the loop. If it's gracefully handled, it should at least print a warning with something like logger.warning("rope freq high and low wavelengths can't be equal") (or something more informative).

(EDIT: woah, answering a comment while making a review is weird and seems to duplicate the message, with both instances having the same permalink)

compilade · 2024-07-26T21:43:11Z

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

That's what the code already seems to do. It only uses the rope freqs when present.

If so, I think there should be a warning

I'm not sure there can be such a warning, because older models like llama-2 still use the same model architecture (LLM_ARCH_LLAMA, so internally they can't really be distinguished), yet they don't require to use these modified rope freqs.

This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192

Co-authored-by: compilade <git@compilade.net>

convert_hf_to_gguf.py

ddh0 · 2024-07-27T01:17:32Z

Just cloned branch jmorganca:masteras of commit 90fd87df4155aef5f099812a99c1e06c0b588c0d and used it to convert llama 3.1 8B instruct model and quantize to q8_0. Everything is working perfectly as far as I can tell. The only thing that I see potentially wrong is the pre-tokenizer is reported as smaug-bpe. Not sure if that's affecting anything or not.

src/llama.cpp

convert_hf_to_gguf.py

compilade · 2024-07-27T01:53:47Z

The only thing that I see potentially wrong is the pre-tokenizer is reported as smaug-bpe. Not sure if that's affecting anything or not.

@ddh0 When I convert the latest Llama-3.1-8B-Instruct from https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct (at HF commit 07eb05b21) with the changes from this PR (at 90fd87d), the pre-tokenizer is detected as llama-bpe, so I guess the pre-tokenizer differences have been fixed upstream.

$ sha256sum Meta-Llama-3.1-8B-Instruct/tokenizer{,_config}.json 
79e3e522635f3171300913bb421464a87de6222182a0570b9b2ccba2a964b2b4  Meta-Llama-3.1-8B-Instruct/tokenizer.json
24e8a6dc2547164b7002e3125f10b415105644fcf02bf9ad8b674c87b1eaaed6  Meta-Llama-3.1-8B-Instruct/tokenizer_config.json

ddh0 · 2024-07-27T03:45:08Z

Thanks @compilade, you were right. After pulling the latest repo the tokenizer is detected as llama-bpe. Thanks

Co-authored-by: compilade <git@compilade.net>

* Add llama 3.1 rope scaling factors to llama conversion and inference This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192 * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * address comments * address comments * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>

github-actions bot added the python python script changes label Jul 24, 2024

jmorganca commented Jul 24, 2024

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

Galunid reviewed Jul 24, 2024

View reviewed changes

jmorganca marked this pull request as draft July 24, 2024 20:32

github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Jul 24, 2024

jmorganca marked this pull request as ready for review July 24, 2024 22:30

jxy reviewed Jul 24, 2024

View reviewed changes

jmorganca force-pushed the master branch from 2f4809b to 3f53dfe Compare July 24, 2024 22:42

jmorganca force-pushed the master branch from 3f53dfe to 7269067 Compare July 24, 2024 22:48

compilade reviewed Jul 25, 2024

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

foldl mentioned this pull request Jul 25, 2024

Feature Request: Proper Llama 3.1 Support in llama.cpp #8650

Closed

4 tasks

compilade reviewed Jul 25, 2024

View reviewed changes

bartowski1182 reviewed Jul 26, 2024

View reviewed changes

compilade reviewed Jul 26, 2024

View reviewed changes

jmorganca and others added 4 commits July 26, 2024 15:01

Update convert_hf_to_gguf.py

24540dd

Co-authored-by: compilade <git@compilade.net>

address comments

1a3a1b6

address comments

90fd87d

jmorganca force-pushed the master branch from a946b40 to 90fd87d Compare July 26, 2024 22:11

compilade reviewed Jul 26, 2024

View reviewed changes

convert_hf_to_gguf.py Show resolved Hide resolved

compilade approved these changes Jul 27, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

jmorganca and others added 2 commits July 27, 2024 00:39

Update src/llama.cpp

e6d5bed

Co-authored-by: compilade <git@compilade.net>

Update convert_hf_to_gguf.py

658041d

Co-authored-by: compilade <git@compilade.net>

ggerganov merged commit b5e9546 into ggerganov:master Jul 27, 2024
55 checks passed

Azirine mentioned this pull request Jul 28, 2024

Bug: Llama 3.1 might not be fully supported yet #8730

Closed

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Jul 28, 2024

don't build rope factors from ggerganov#8676 for CLBlast as it segfaults

e47477f

mudler mentioned this pull request Jul 28, 2024

chore: ⬆️ Update ggerganov/llama.cpp mudler/LocalAI#3034

Merged

camAtGitHub mentioned this pull request Jul 30, 2024

Bug: llama 3.1 and variants fail with error "wrong number of tensors; expected 292, got 291" Mozilla-Ocho/llamafile#516

Open

This was referenced Aug 7, 2024

[Bug]: model loading failed for some Llama3.1 GGUF model vllm-project/vllm#7268

Closed

[Bugfix] Fix new Llama3.1 GGUF model loading vllm-project/vllm#7269

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama 3.1 rope scaling factors to llama conversion and inference #8676

Add llama 3.1 rope scaling factors to llama conversion and inference #8676

jmorganca commented Jul 24, 2024

Galunid Jul 24, 2024

jmorganca commented Jul 24, 2024

jxy commented Jul 24, 2024

jxy Jul 24, 2024

jxy commented Jul 24, 2024

jmorganca commented Jul 24, 2024

kallewoof commented Jul 25, 2024

MoonRide303 commented Jul 25, 2024 •

edited

Loading

tristandruyen commented Jul 25, 2024 •

edited

Loading

Nexesenex commented Jul 25, 2024 •

edited

Loading

MoonRide303 commented Jul 25, 2024 •

edited

Loading

LostRuins commented Jul 25, 2024

MoonRide303 commented Jul 25, 2024 •

edited

Loading

LostRuins commented Jul 25, 2024

kallewoof commented Jul 25, 2024

schmorp commented Jul 25, 2024

compilade Jul 25, 2024 •

edited

Loading

jmorganca commented Jul 25, 2024 •

edited

Loading

ggerganov commented Jul 25, 2024

gilbertgong commented Jul 25, 2024 •

edited

Loading

3Simplex commented Jul 25, 2024

qnixsynapse commented Jul 26, 2024

LostRuins commented Jul 26, 2024 •

edited

Loading

slaren commented Jul 26, 2024

bartowski1182 Jul 26, 2024

compilade Jul 26, 2024 •

edited

Loading

gilbertgong commented Jul 26, 2024

oldgithubman commented Jul 26, 2024 •

edited

Loading

m18coppola commented Jul 26, 2024

compilade Jul 26, 2024 •

edited

Loading

compilade commented Jul 26, 2024

ddh0 commented Jul 27, 2024

compilade commented Jul 27, 2024 •

edited

Loading

ddh0 commented Jul 27, 2024

Add llama 3.1 rope scaling factors to llama conversion and inference #8676

Add llama 3.1 rope scaling factors to llama conversion and inference #8676

Conversation

jmorganca commented Jul 24, 2024

Galunid Jul 24, 2024

Choose a reason for hiding this comment

jmorganca commented Jul 24, 2024

jxy commented Jul 24, 2024

jxy Jul 24, 2024

Choose a reason for hiding this comment

jxy commented Jul 24, 2024

jmorganca commented Jul 24, 2024

kallewoof commented Jul 25, 2024

MoonRide303 commented Jul 25, 2024 • edited Loading

tristandruyen commented Jul 25, 2024 • edited Loading

Nexesenex commented Jul 25, 2024 • edited Loading

MoonRide303 commented Jul 25, 2024 • edited Loading

LostRuins commented Jul 25, 2024

MoonRide303 commented Jul 25, 2024 • edited Loading

LostRuins commented Jul 25, 2024

kallewoof commented Jul 25, 2024

schmorp commented Jul 25, 2024

compilade Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

jmorganca commented Jul 25, 2024 • edited Loading

ggerganov commented Jul 25, 2024

gilbertgong commented Jul 25, 2024 • edited Loading

3Simplex commented Jul 25, 2024

qnixsynapse commented Jul 26, 2024

LostRuins commented Jul 26, 2024 • edited Loading

slaren commented Jul 26, 2024

bartowski1182 Jul 26, 2024

Choose a reason for hiding this comment

compilade Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

gilbertgong commented Jul 26, 2024

oldgithubman commented Jul 26, 2024 • edited Loading

m18coppola commented Jul 26, 2024

compilade Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

compilade commented Jul 26, 2024

ddh0 commented Jul 27, 2024

compilade commented Jul 27, 2024 • edited Loading

ddh0 commented Jul 27, 2024

MoonRide303 commented Jul 25, 2024 •

edited

Loading

tristandruyen commented Jul 25, 2024 •

edited

Loading

Nexesenex commented Jul 25, 2024 •

edited

Loading

MoonRide303 commented Jul 25, 2024 •

edited

Loading

MoonRide303 commented Jul 25, 2024 •

edited

Loading

compilade Jul 25, 2024 •

edited

Loading

jmorganca commented Jul 25, 2024 •

edited

Loading

gilbertgong commented Jul 25, 2024 •

edited

Loading

LostRuins commented Jul 26, 2024 •

edited

Loading

compilade Jul 26, 2024 •

edited

Loading

oldgithubman commented Jul 26, 2024 •

edited

Loading

compilade Jul 26, 2024 •

edited

Loading

compilade commented Jul 27, 2024 •

edited

Loading