-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gemma
model
#5631
Add gemma
model
#5631
Conversation
There are couple things in this architecture: 1. Shared input and output embedding parameters. 2. Key length and value length are not derived from `n_embd`. More information about the models can be found at https://ai.google.dev/gemma. GGUFs can be downloaded from https://huggingface.co/google.
that was fast |
Holly Moses. This was fast. Thank you |
A model converted and quantized from the safetensors weights still fails with
for me.
tensor visible in the conversion and quantization output though. |
Interesting, is there a reason why the GGUF file is twice as large as the safetensors? |
This depends on how your conversion is done. Two things to make sure: 1) the
The weights here are as close to the internal checkpoints as you can get. They are in float32. We are leaning into the community to experiment with other quantized versions ;). For example, you could use the |
This PR doesn't make any changes to the convert scripts. How do I convert a Gemma model to GGUF? |
You could simply download the models released on HuggingFace, for example https://huggingface.co/google/gemma-2b/blob/main/gemma-2b.gguf. |
Are there plans to open-source the conversion scripts used, or will the community have to implement them? The safetensors checkpoint is a smaller download (presumably because of BF16 being converted to F32?) and one would imagine that people would like to be able to manipulate the Transformers weights (merge, finetune, etc.) before converting to GGUF, just as they do with other model architectures. |
I don't work with SafeTensors so I can't promise I will take this up personally. I'm sure folks will contribute later though 🤞 . |
Yup, hope we get some insights. I tried updating diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 9771fccf..d328e524 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -218,6 +218,8 @@ class Model:
return BertModel
if model_architecture == "NomicBertModel":
return NomicBertModel
+ if model_architecture in "GemmaForCausalLM":
+ return GemmaModel
return Model
def _is_model_safetensors(self) -> bool:
@@ -277,6 +279,8 @@ class Model:
return gguf.MODEL_ARCH.BERT
if arch == "NomicBertModel":
return gguf.MODEL_ARCH.NOMIC_BERT
+ if arch in "GemmaForCausalLM":
+ return gguf.MODEL_ARCH.GEMMA
raise NotImplementedError(f'Architecture "{arch}" not supported!')
@@ -1785,6 +1789,24 @@ class NomicBertModel(BertModel):
yield name, data
+class GemmaModel(Model):
+ def set_vocab(self):
+ self._set_vocab_sentencepiece()
+
+ def set_gguf_parameters(self):
+ hparams = self.hparams
+ block_count = hparams["num_hidden_layers"]
+
+ self.gguf_writer.add_name(self.dir_model.name)
+ self.gguf_writer.add_context_length(hparams["max_position_embeddings"])
+ self.gguf_writer.add_embedding_length(hparams["hidden_size"])
+ self.gguf_writer.add_block_count(block_count)
+ self.gguf_writer.add_feed_forward_length(hparams["intermediate_size"])
+ self.gguf_writer.add_head_count(hparams["num_attention_heads"])
+ self.gguf_writer.add_layer_norm_rms_eps(self.hparams["rms_norm_eps"])
+ self.gguf_writer.add_head_count_kv(self.hparams["num_key_value_heads"] if "num_key_value_heads" in hparams else hparams["num_attention_heads"])
+
+ One thing that is strange is the vocab size in The F32 GGUF files work as expected I'm currently testing just with the 2B model |
That won't do the right thing... |
I would not be surprised if the Gemma implementation in HF Transformers requires different transposes of the weight tensors than the implementation in this PR. |
Huh very weird. I've been dumping the tensors from the locally converted models and comparing the values with the provided F32 GGUF models. The values are not transposed. However, all # Huh? Why is this needed?
if name.endswith(("norm.weight")):
data_torch = data_torch + 1 Here is the full diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 9771fccf..e88308dc 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -218,6 +218,8 @@ class Model:
return BertModel
if model_architecture == "NomicBertModel":
return NomicBertModel
+ if model_architecture in "GemmaForCausalLM":
+ return GemmaModel
return Model
def _is_model_safetensors(self) -> bool:
@@ -277,6 +279,8 @@ class Model:
return gguf.MODEL_ARCH.BERT
if arch == "NomicBertModel":
return gguf.MODEL_ARCH.NOMIC_BERT
+ if arch in "GemmaForCausalLM":
+ return gguf.MODEL_ARCH.GEMMA
raise NotImplementedError(f'Architecture "{arch}" not supported!')
@@ -1785,6 +1789,64 @@ class NomicBertModel(BertModel):
yield name, data
+class GemmaModel(Model):
+ def set_vocab(self):
+ self._set_vocab_sentencepiece()
+
+ def set_gguf_parameters(self):
+ hparams = self.hparams
+ block_count = hparams["num_hidden_layers"]
+
+ self.gguf_writer.add_name(self.dir_model.name)
+ self.gguf_writer.add_context_length(hparams["max_position_embeddings"])
+ self.gguf_writer.add_embedding_length(hparams["hidden_size"])
+ self.gguf_writer.add_block_count(block_count)
+ self.gguf_writer.add_feed_forward_length(hparams["intermediate_size"])
+ self.gguf_writer.add_head_count(hparams["num_attention_heads"])
+ self.gguf_writer.add_layer_norm_rms_eps(self.hparams["rms_norm_eps"])
+ self.gguf_writer.add_head_count_kv(self.hparams["num_key_value_heads"] if "num_key_value_heads" in hparams else hparams["num_attention_heads"])
+
+ def write_tensors(self):
+ block_count = self.hparams.get("n_layers", self.hparams.get("num_hidden_layers", self.hparams.get("n_layer")))
+ tensor_map = gguf.get_tensor_name_map(self.model_arch, block_count)
+
+ for name, data_torch in self.get_tensors():
+ # we don't need these
+ if name.endswith((".attention.masked_bias", ".attention.bias", ".attention.rotary_emb.inv_freq", ".attn.bias", ".attn.masked_bias")):
+ continue
+
+ # Huh? Why is this needed?
+ if name.endswith(("norm.weight")):
+ data_torch = data_torch + 1
+
+ old_dtype = data_torch.dtype
+
+ # convert any unsupported data types to float32
+ if data_torch.dtype not in (torch.float16, torch.float32):
+ data_torch = data_torch.to(torch.float32)
+
+ data = data_torch.squeeze().numpy()
+
+ # map tensor names
+ new_name = tensor_map.get_name(name, try_suffixes=(".weight", ".bias"))
+ if new_name is None:
+ print(f"Can not map tensor {name!r}")
+ sys.exit()
+
+ n_dims = len(data.shape)
+ data_dtype = data.dtype
+
+ data = data.astype(np.float32)
+
+ # if f16 desired, convert any float32 2-dim weight tensors to float16
+ if self.ftype == 1 and data_dtype == np.float32 and name.endswith(".weight") and n_dims == 2:
+ data = data.astype(np.float16)
+
+ print(f"{new_name}, n_dims = {n_dims}, {old_dtype} --> {data.dtype}")
+
+ self.gguf_writer.add_tensor(new_name, data)
+
+ Edit: Ah, there it is: Edit2: here is a PR with the conversion script: |
so fast!!! |
Google has already provided GGUF (float32) in hf repo. |
@ggerganov could we add check against bfloat16 dtype ? |
What checks do you have in mind specifically? |
Google uses But I personally believe not too much degression after converting from bf16 to float16. I guess in the converter, if no float16, then we convert them to float32 (or even better if no overflow happens). If we check if tensor.dtype is bf16 and keep it as fp16, we will have a 17 GB GGUF file instead of a 34 GB GGUF file. |
Team, thank you for integrating Gemma support into llama.cpp yesterday - this was an extremely fast and efficient alignment with a model that just came out a couple hours before. |
Trying out the F32 ggml-7b-it.gguf provided by Google, I'm getting a perplexity of "nan" at 2048 context - around 20 for the first few chunks. Also around 25 for the first chunk at 8192 context. For reference, llama-2-7b Q4_0 perplexity is about 5.16 at 4096 context. @postmasters Are you sure the implementation is correct? |
DANtm in #5635 (comment) suggested that setting |
With the gemma-7b.gguf base model (without instruction tuning), converted to f16, I get 6.5376 PPL at 2048 context, 6.2240 at 8912 context. |
Weird. Here is what I just tried:
And there's no point in running it longer than that because the running average will stay NaN. |
I didn't convert from the HF model, I downloaded the fp32 gguf and converted it to fp16 with the |
It works with Metal and CPU using the convert model from HF data: llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 154618.82 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 896.00 MiB, (17182.56 / 147456.00)
llama_kv_cache_init: Metal KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CPU input buffer size = 11.02 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 506.00 MiB, (17688.56 / 147456.00)
llama_new_context_with_model: Metal compute buffer size = 506.00 MiB
llama_new_context_with_model: CPU compute buffer size = 6.00 MiB
llama_new_context_with_model: graph splits (measure): 3
system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 680.214 ms
perplexity: calculating perplexity over 142 chunks, batch_size=512
perplexity: 2.20 seconds per pass - ETA 5.20 minutes
[1]5.6440,[2]6.8762,[3]7.2915,[4]6.5856,[5]6.1074,^C I haven't tried CUDA |
With the 7B
[1]18.7209,[2]24.9808,[3]27.5477,[4]24.3773,[5]23.0816,[6]18.8618,[7]17.1438,[8]16.7933,[9]17.6730,[10]17.7396,[11]17.5431,[12]18.4702,[13]18.4808,[14]18.8670,[15]19.0879,[16]19.8607,[17]19.9045,[18]20.0591,[19]19.5361,[20]19.5234,[21]19.4770,[22]19.6030,[23]19.7331,[24]19.7435,[25]20.3328,[26]20.3213,[27]21.0581,[28]21.2913,[29]21.3306,[30]21.3418,[31]21.2371,[32]20.8257,[33]20.9147,[34]20.7695,[35]20.3417,[36]19.8174,[37]19.3932,[38]19.1021,[39]18.6685,[40]18.3326,[41]18.4187,[42]18.7619,[43]19.1801,[44]19.2661,[45]19.5229,[46]19.7044,[47]19.8079,[48]19.8868,[49]19.7144,[50]19.8063,[51]19.6146,[52]19.4874,[53]19.3591,[54]19.1463,[55]19.0401,[56]18.7300,[57]18.7026,[58]18.6677,[59]18.7592,[60]18.9321,[61]19.0706,[62]19.3321,[63]19.3754,[64]19.1777,[65]19.1974,[66]19.0348,[67]18.9782,[68]18.9456,[69]18.7255,[70]18.6829,[71]18.9174,[72]19.0924,[73]19.0105,[74]19.0003,[75]19.0218,[76]19.0347,[77]19.0468,[78]19.0965,[79]19.2280,[80]19.1060,[81]19.0114,[82]18.8994,[83]18.8548,[84]18.8064,[85]18.6885,[86]18.6932,[87]18.7991,[88]18.8360,[89]18.9913,[90]19.2310,[91]19.3942,[92]19.4716,[93]19.6063,[94]19.7548,[95]19.8119,[96]19.8152,[97]19.8260,[98]19.8841,[99]19.8545,[100]19.8863,[101]19.9525,[102]20.0558,[103]20.1068,[104]20.1235,[105]20.1545,[106]20.1087,[107]20.1187,[108]20.1356,[109]20.0118,[110]19.9838,[111]19.9104,[112]19.9309,[113]19.9644,[114]20.0094,[115]19.9969,[116]19.9885,[117]19.9382,[118]19.9754,[119]19.9458,[120]19.8893,[121]19.8274,[122]19.8208,[123]19.8898,[124]19.8708,[125]19.8470,[126]19.7959,[127]19.7669,[128]19.8034,[129]19.7301,[130]19.7389,[131]19.7456,[132]19.8056,[133]19.8714,[134]19.7818,[135]19.5798,[136]19.6104,[137]19.6654,[138]19.7277,[139]19.7196,[140]19.7987,[141]19.7966,[142]19.8815,
[1]20.1795,[2]26.9640,[3]29.7014,[4]26.0533,[5]24.6819,[6]20.1910,[7]18.3000,[8]17.9538,[9]18.9682,[10]19.0602,[11]18.8272,[12]19.8369,[13]19.8639,[14]20.2951,[15]20.5834,[16]21.4171,[17]21.4557,[18]21.5971,[19]21.0191,[20]21.0192,[21]20.9496,[22]21.0968,[23]21.2142,[24]21.2150,[25]21.8738,[26]21.8679,[27]22.6807,[28]22.9258,[29]22.9705,[30]22.9568,[31]22.8573,[32]22.4253,[33]22.5072,[34]22.3343,[35]21.8594,[36]21.2822,[37]20.8135,[38]20.5065,[39]20.0285,[40]19.6789,[41]19.7705,[42]20.1286,[43]20.5842,[44]20.6894,[45]20.9686,[46]21.1656,[47]21.2816,[48]21.3709,[49]21.1856,[50]21.2821,[51]21.0609,[52]20.9323,[53]20.8012,[54]20.5745,[55]20.4565,[56]20.1180,[57]20.0956,[58]20.0513,[59]20.1503,[60]20.3293,[61]20.4832,[62]20.7717,[63]20.8173,[64]20.6016,[65]20.6253,[66]20.4452,[67]20.3804,[68]20.3380,[69]20.0944,[70]20.0470,[71]20.3014,[72]20.4836,[73]20.3948,[74]20.3817,[75]20.4047,[76]20.4281,[77]20.4366,[78]20.4864,[79]20.6221,[80]20.4927,[81]20.3864,[82]20.2693,[83]20.2190,[84]20.1664,[85]20.0367,[86]20.0338,[87]20.1434,[88]20.1821,[89]20.3557,[90]20.6163,[91]20.7865,[92]20.8807,[93]21.0238,[94]21.1747,[95]21.2370,[96]21.2429,[97]21.2521,[98]21.3145,[99]21.2806,[100]21.3140,[101]21.3920,[102]21.5062,[103]21.5592,[104]21.5844,[105]21.6126,[106]21.5651,[107]21.5795,[108]21.5964,[109]21.4639,[110]21.4302,[111]21.3496,[112]21.3736,[113]21.4072,[114]21.4540,[115]21.4443,[116]21.4348,[117]21.3798,[118]21.4199,[119]21.3902,[120]21.3283,[121]21.2562,[122]21.2507,[123]21.3275,[124]21.3050,[125]21.2840,[126]21.2304,[127]21.2018,[128]21.2447,[129]21.1617,[130]21.1693,[131]21.1789,[132]21.2450,[133]21.3154,[134]21.2170,[135]20.9955,[136]21.0295,[137]21.0900,[138]21.1598,[139]21.1552,[140]21.2385,[141]21.2370,[142]21.3300, |
Fun fact: This will leave you with a Q6_K output tensor unless you pass
I've discovered that these NaNs occur with -ngl 2 and above, but not with -ngl 1 or with --no-kv-offload. I can reproduce them on my P40 with either the FP16 converted from safetensors, or the FP16 quantized from Google's provided GGUF. @slaren I wonder if you can reproduce if you build with |
I tried with
|
There are couple things in this architecture: 1. Shared input and output embedding parameters. 2. Key length and value length are not derived from `n_embd`. More information about the models can be found at https://ai.google.dev/gemma. GGUFs can be downloaded from https://huggingface.co/google.
There are couple things in this architecture: 1. Shared input and output embedding parameters. 2. Key length and value length are not derived from `n_embd`. More information about the models can be found at https://ai.google.dev/gemma. GGUFs can be downloaded from https://huggingface.co/google.
There are couple things in this architecture:
n_embd
.More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from https://huggingface.co/google.