Description
Following from discussions in the Llama 2 70B PR: #2276 :
Since that PR, converting Llama 2 70B models from Meta's original PTH format files works great.
But it is not possible to make usable Llama 2 70B models from HF format. The models convert and quantise fine, but always produce gibberish, as in this example:
### Human: write a story about llamas\n### Assistant:20 300202000 B00A0
It looks like the tensors gets transformed with the new permute using the GQA parameter num_local_key_value_heads and num_key_value_heads somehow:
https://github.com/huggingface/transformers/blob/b257c46a075419c09e5ce5c5aa39bc346ecdb9a5/src/transformers/models/llama/convert_llama_weights_to_hf.py#L173-L195
For reference, here are all the changes that happened in Transformers' convert_llama_weights_to_hf.py
for the Llama 2 release: huggingface/transformers@07360b6#diff-110a445233a8b15a0875998eeaf75cb8607b38a5daa736291dd058766879bbdd
Would anyone be able to look into this? It's a bit beyond my experience.
I'm getting multiple requests a day for 70B fine tune quants for FreeWilly 2, Llama2-Guanaco, and the newly released Airoboros 1.4.1 70B, and would love to be able to provide them for people.
Thanks in advance.