-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add llama 3.1 rope scaling factors to llama conversion and inference #8676
Conversation
convert_hf_to_gguf.py
Outdated
@@ -1514,6 +1514,35 @@ def set_gguf_parameters(self): | |||
if self.hparams.get("vocab_size", 32000) == 49152: | |||
self.gguf_writer.add_add_bos_token(False) | |||
|
|||
if rope_scaling := self.find_hparam(["rope_scaling"], optional=True): | |||
if rope_scaling.get("rope_type", '').lower() == "llama3": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best if you merge this into the rope scaling above (line 1501), so that all the rope operations are in one place.
Marking this as draft as there may also be a change required to the rope kernels – investigating! |
Probably it's not a good idea to change the kernels. I just tested creating the factors instead as in my comment above. Here's the changes on top of your PR, diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 4422948f..897a6e33 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -1518,7 +1518,7 @@ class LlamaModel(Model):
if rope_scaling.get("rope_type", '').lower() == "llama3":
base = hparams.get("rope_theta", 10000.0)
dim = int((hparams["hidden_size"] // hparams["num_attention_heads"]) * hparams.get("partial_rotary_embeddings", 1.0))
- inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+ freqs = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
factor = rope_scaling.get("factor", 8.0)
low_freq_factor = rope_scaling.get("low_freq_factor", 1.0)
@@ -1528,20 +1528,20 @@ class LlamaModel(Model):
low_freq_wavelen = old_context_len / low_freq_factor
high_freq_wavelen = old_context_len / high_freq_factor
- rope_freqs = []
- for freq in inv_freq:
+ rope_factors = []
+ for freq in freqs:
wavelen = 2 * math.pi / freq
if wavelen < high_freq_wavelen:
- rope_freqs.append(freq)
+ rope_factors.append(1)
elif wavelen > low_freq_wavelen:
- rope_freqs.append(freq / factor)
+ rope_factors.append(factor)
else:
assert low_freq_wavelen != high_freq_wavelen
smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
- rope_freqs.append((1 - smooth) * freq / factor + smooth * freq)
+ rope_factors.append(1 / ((1 - smooth) / factor + smooth))
self.gguf_writer.add_rope_scaling_attn_factors(1.0)
- self.gguf_writer.add_tensor(gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ROPE_FREQS] + ".weight", 1.0 / np.array(rope_freqs))
+ self.gguf_writer.add_tensor(gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ROPE_FREQS] + ".weight", np.array(rope_factors, dtype = np.float32))
@staticmethod
def permute(weights: Tensor, n_head: int, n_head_kv: int | None): A quick test shows perplexity is slightly better than master branch ([1]4.6850,[2]5.9231),
|
convert_hf_to_gguf.py
Outdated
else: | ||
assert low_freq_wavelen != high_freq_wavelen | ||
smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor) | ||
rope_factors.append(1 / (1 - smooth) * factor + smooth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the invert of https://github.com/meta-llama/llama-models/blob/1b5892739868e5333fb7f022ba91218f0ae5f9c2/models/llama3_1/api/model.py#L62 factoring out the freq
, which should be 1/((1 - smooth) / scale_factor + smooth)
It should be
|
Awesome, thanks @jxy. Should be updated now. Agreed much better without the kernel changes. |
Nice. Tested doing a bunch of summaries using up the entire 128k context and the output looks good whereas on master it outputs broken garbage. |
Initial tests show improvement over master (model able to reason within bigger context - I tested up to
FYI @LostRuins |
Isn't this just a side-effect of kobold.cpp using a different llama.cpp version internally, which would need to be updated anyways to take advantage of these changes ? |
@MoonRide303 : I merged this PR in Kobold.cpp Frankenstein this morning (v171013), and it works (sensical inference) without problem beyond 8k context now. |
I tested it on koboldcpp 1.71 (released today, having merged upstream changes yesterday) - but you might be right some changes from original llama.cpp are missing (or some extra changes might be problematic). I am not sure if it's something to worry about on the llama.cpp project side, just pointing it out as potential side effect - for some people new GGUFs might not work. |
I wasn't aware that this PR breaks previously quantized Llama 3.1 models, I thought that all prior models would continue to work. Anyway, I'll do a patch release once this is merged into master. |
Old GGUFs will continue to work, it's just newly converted models (with this PR changes) that might not work with older llama.cpp-based apps (like koboldcpp 1.71). |
Yeah but that's my point - it's failing to load because the tensor counts don't match the metadata, which it should in both cases, unless I misunderstand the error. |
Merging with koboldcpp concedo_experimental works fine on my end, FWIW. |
Why couldn't this tensor be added by llama.cpp when loading? superficially it doesn't make much sense to bake the rope config into the model at conversion time, and prevents bugfixes at a later time. |
convert_hf_to_gguf.py
Outdated
rope_factors.append(1 / ((1 - smooth) / factor + smooth)) | ||
|
||
self.gguf_writer.add_rope_scaling_attn_factors(1.0) | ||
self.gguf_writer.add_tensor(self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FREQS), np.array(rope_factors, dtype=np.float32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be included in --vocab-only
models because it's in set_gguf_parameters
, while the tensor data is not included in vocab-only files.
But the tensor count is still in the header of GGUF models, so I expect vocab-only files of Llama-3.1 to be malformed because of this.
This tensor could likely instead be "inserted" in the override of self.prepare_tensors
to add it before the others and then call super().prepare_tensors()
. If it's cleaner, it's possible to keep inside set_gguf_parameters
the code which builds the tensor content, as long as the tensor is only "added" in prepare_tensors
.
Thanks! Given the hard check on the tensor count at load time, it seems we might need to instead add the rope parameters as metadata and calculate it at runtime. Definitely did not want to break backwards compatibility with old GGUF files with this change sorry! |
Thank you for looking into this. Btw even if it is in the metadata and the rope freqs are computed at runtime, it will still not be backwards compatible because the old GGUF files will not have the metadata |
Storing the information as metadata and computing at runtime seems preferable to me, as I think it seems more correct to have the tensor count match that of loading the model in HF transformers, for example (and just reduce potential confusion and people making incorrect assumptions based on mismatched tensor count). One could bake in defaults for when the metadata is missing from the gguf - that would make it backwards compatible, if desired. |
Exactly! I see no reason to make it backward compatible. |
Crashing outright is bad, a better result would be a warning printed to stdout. It's always better to fail gracefully when possible. sort of like how we did this: Lines 5344 to 5352 in 01245f5
|
It doesn't crash, it causes |
convert_hf_to_gguf.py
Outdated
elif wavelen > low_freq_wavelen: | ||
rope_factors.append(factor) | ||
else: | ||
assert low_freq_wavelen != high_freq_wavelen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we assert if they're even or just not apply it at all and return gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assertion should be outside of the loop. These values don't change within the loop. If it's gracefully handled, it should at least print a warning with something like logger.warning("rope freq high and low wavelengths can't be equal")
(or something more informative).
(EDIT: woah, answering a comment while making a review is weird and seems to duplicate the message, with both instances having the same permalink)
Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing. |
We need more of this mentality around here. There's usually a way |
If so, I think there should be a warning similar to what @LostRuins mentioned. |
convert_hf_to_gguf.py
Outdated
elif wavelen > low_freq_wavelen: | ||
rope_factors.append(factor) | ||
else: | ||
assert low_freq_wavelen != high_freq_wavelen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assertion should be outside of the loop. These values don't change within the loop. If it's gracefully handled, it should at least print a warning with something like logger.warning("rope freq high and low wavelengths can't be equal")
(or something more informative).
(EDIT: woah, answering a comment while making a review is weird and seems to duplicate the message, with both instances having the same permalink)
That's what the code already seems to do. It only uses the rope freqs when present.
I'm not sure there can be such a warning, because older models like |
This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192
Co-authored-by: compilade <git@compilade.net>
Just cloned branch |
@ddh0 When I convert the latest Llama-3.1-8B-Instruct from https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct (at HF commit $ sha256sum Meta-Llama-3.1-8B-Instruct/tokenizer{,_config}.json
79e3e522635f3171300913bb421464a87de6222182a0570b9b2ccba2a964b2b4 Meta-Llama-3.1-8B-Instruct/tokenizer.json
24e8a6dc2547164b7002e3125f10b415105644fcf02bf9ad8b674c87b1eaaed6 Meta-Llama-3.1-8B-Instruct/tokenizer_config.json |
Thanks @compilade, you were right. After pulling the latest repo the tokenizer is detected as |
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: compilade <git@compilade.net>
* Add llama 3.1 rope scaling factors to llama conversion and inference This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192 * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * address comments * address comments * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>
* Add llama 3.1 rope scaling factors to llama conversion and inference This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192 * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * address comments * address comments * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: compilade <git@compilade.net>
Hi all, this commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the
ggml_rope_ext
rope operation.From our testing, this really improves results for context windows above 8192 for Llama 3.1.
This should fix #8650