Skip to content

Commit e26e218

Browse files
piDackpiDackngxsonggerganov
authored andcommitted
llama : add support for GLM-Edge and GLM-Edge-V series models (ggml-org#10573)
* add glm edge chat model * use config partial_rotary_factor as rope ratio * support for glm edge model * vision model support * remove debug info * fix format * llava.cpp trailing whitespace * remove unused AutoTokenizer * Update src/llama.cpp for not contain <|end|> or </s> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * add edge template * fix chat template * fix confict * fix confict * fix ci err * fix format err * fix template err * 9b hf chat support * format * format clip.cpp * fix format * Apply suggestions from code review * Apply suggestions from code review * Update examples/llava/clip.cpp * fix format * minor : style --------- Co-authored-by: liyuhang <yuhang.li@zhipuai.cn> Co-authored-by: piDack <pcdack@hotmail.co> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: liyuhang <yuhang.li@aminer.cn> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
1 parent 10c66e5 commit e26e218

15 files changed

+568
-67
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
9696
- [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
9797
- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
9898
- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
99-
- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
99+
- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b) + [GLMEdge-1.5b](https://huggingface.co/THUDM/glm-edge-1.5b-chat) + [GLMEdge-4b](https://huggingface.co/THUDM/glm-edge-4b-chat)
100100
- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
101101
- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
102102
- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
@@ -117,6 +117,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
117117
- [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
118118
- [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
119119
- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
120+
- [x] [GLM-EDGE](https://huggingface.co/models?search=glm-edge)
120121
- [x] [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)
121122

122123
</details>

convert_hf_to_gguf.py

Lines changed: 15 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -648,7 +648,7 @@ def get_vocab_base_pre(self, tokenizer) -> str:
648648
if chkhsh == "7967bfa498ade6b757b064f31e964dddbb80f8f9a4d68d4ba7998fcf281c531a":
649649
# ref: https://huggingface.co/jinaai/jina-embeddings-v2-base-code
650650
res = "jina-v2-code"
651-
if chkhsh == "b6e8e1518dc4305be2fe39c313ed643381c4da5db34a98f6a04c093f8afbe99b":
651+
if chkhsh == "b6e8e1518dc4305be2fe39c313ed643381c4da5db34a98f6a04c093f8afbe99b" or chkhsh == "81d72c7348a9f0ebe86f23298d37debe0a5e71149e29bd283904c02262b27516":
652652
# ref: https://huggingface.co/THUDM/glm-4-9b-chat
653653
res = "chatglm-bpe"
654654
if chkhsh == "7fc505bd3104ca1083b150b17d088b59534ede9bde81f0dd2090967d7fe52cee":
@@ -4513,7 +4513,7 @@ def prepare_tensors(self):
45134513
self.gguf_writer.add_max_alibi_bias(self.max_alibi_bias)
45144514

45154515

4516-
@Model.register("ChatGLMModel", "ChatGLMForConditionalGeneration")
4516+
@Model.register("GlmForCausalLM", "ChatGLMModel", "ChatGLMForConditionalGeneration")
45174517
class ChatGLMModel(Model):
45184518
model_arch = gguf.MODEL_ARCH.CHATGLM
45194519

@@ -4619,47 +4619,15 @@ def set_vocab(self):
46194619

46204620
from transformers import AutoTokenizer
46214621
tokenizer = AutoTokenizer.from_pretrained(dir_model, trust_remote_code=True)
4622-
vocab_size = hparams["padded_vocab_size"]
4622+
vocab_size = hparams.get("padded_vocab_size",hparams["vocab_size"])
46234623
assert max(tokenizer.get_vocab().values()) < vocab_size
46244624

4625-
tokpre = self.get_vocab_base_pre(tokenizer)
4626-
4627-
merges = []
4628-
vocab = {}
4629-
mergeable_ranks = tokenizer.mergeable_ranks
4630-
for token, rank in mergeable_ranks.items():
4631-
vocab[ChatGLMModel.token_bytes_to_string(token)] = rank
4632-
if len(token) == 1:
4633-
continue
4634-
merged = ChatGLMModel.bpe(mergeable_ranks, token, max_rank=rank)
4635-
assert len(merged) >= 2 and len(merged) <= 7
4636-
merges.append(' '.join(map(ChatGLMModel.token_bytes_to_string, merged)))
4637-
4638-
# for this kind of tokenizer, added_vocab is not a subset of vocab, so they need to be combined
4639-
added_vocab = tokenizer.get_added_vocab()
4640-
reverse_vocab = {id_ : encoded_tok for encoded_tok, id_ in {**vocab, **added_vocab}.items()}
4641-
4642-
for i in range(vocab_size):
4643-
if i not in reverse_vocab:
4644-
tokens.append(f"[PAD{i}]")
4645-
toktypes.append(gguf.TokenType.UNUSED)
4646-
elif reverse_vocab[i] in added_vocab:
4647-
tokens.append(reverse_vocab[i])
4648-
if tokenizer.added_tokens_decoder[i].special:
4649-
toktypes.append(gguf.TokenType.CONTROL)
4650-
else:
4651-
toktypes.append(gguf.TokenType.USER_DEFINED)
4652-
else:
4653-
tokens.append(reverse_vocab[i])
4654-
toktypes.append(gguf.TokenType.NORMAL)
4655-
4625+
tokens, toktypes, tokpre = self.get_vocab_base()
46564626
self.gguf_writer.add_tokenizer_model("gpt2")
46574627
self.gguf_writer.add_tokenizer_pre(tokpre)
46584628
self.gguf_writer.add_token_list(tokens)
46594629
self.gguf_writer.add_token_types(toktypes)
4660-
4661-
special_vocab = gguf.SpecialVocab(dir_model, load_merges=False)
4662-
special_vocab.merges = merges
4630+
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
46634631
# only add special tokens when they were not already loaded from config.json
46644632
special_vocab._set_special_token("eos", tokenizer.get_added_vocab()["<|endoftext|>"])
46654633
special_vocab._set_special_token("eot", tokenizer.get_added_vocab()["<|user|>"])
@@ -4670,16 +4638,20 @@ def set_vocab(self):
46704638
def set_gguf_parameters(self):
46714639
n_embed = self.hparams.get("hidden_size", self.hparams.get("n_embed"))
46724640
n_head = self.hparams.get("n_head", self.hparams.get("num_attention_heads"))
4673-
n_head_kv = self.hparams.get("multi_query_group_num", n_head)
4641+
n_head_kv = self.hparams.get("multi_query_group_num", self.hparams.get("num_key_value_heads", n_head))
46744642
self.gguf_writer.add_context_length(self.hparams.get("seq_length", n_embed))
46754643
self.gguf_writer.add_embedding_length(n_embed)
4676-
self.gguf_writer.add_feed_forward_length(self.hparams.get("ffn_hidden_size", 4 * n_embed))
4677-
self.gguf_writer.add_block_count(self.hparams["num_layers"])
4644+
self.gguf_writer.add_feed_forward_length(self.hparams.get("ffn_hidden_size", self.hparams.get("intermediate_size", 4 * n_embed)))
4645+
self.gguf_writer.add_block_count(self.hparams.get("num_layers", self.hparams["num_hidden_layers"]))
46784646
self.gguf_writer.add_head_count(n_head)
46794647
self.gguf_writer.add_head_count_kv(n_head_kv)
4680-
self.gguf_writer.add_layer_norm_rms_eps(self.hparams["layernorm_epsilon"])
4648+
self.gguf_writer.add_layer_norm_rms_eps(self.hparams.get("layernorm_epsilon",1e-5))
46814649
self.gguf_writer.add_file_type(self.ftype)
4682-
self.gguf_writer.add_rope_dimension_count(64)
4650+
if "attention_dim" in self.hparams:
4651+
rope_dim = self.hparams["attention_dim"]
4652+
else:
4653+
rope_dim = self.hparams["hidden_size"] // self.hparams["num_attention_heads"]
4654+
self.gguf_writer.add_rope_dimension_count(int(rope_dim * self.hparams.get("partial_rotary_factor", 0.5)))
46834655
self.gguf_writer.add_add_bos_token(False)
46844656
rope_freq = 10000
46854657
if "rope_ratio" in self.hparams:
@@ -4689,7 +4661,7 @@ def set_gguf_parameters(self):
46894661
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
46904662
del bid # unused
46914663

4692-
if name.endswith(".rotary_pos_emb.inv_freq"):
4664+
if name.endswith(".rotary_pos_emb.inv_freq") or name.startswith("model.vision."):
46934665
return []
46944666

46954667
name = name.removeprefix("transformer.")

examples/llava/README-glmedge.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# GLMV-EDGE
2+
3+
Currently this implementation supports [glm-edge-v-2b](https://huggingface.co/THUDM/glm-edge-v-2b) and [glm-edge-v-5b](https://huggingface.co/THUDM/glm-edge-v-5b).
4+
5+
## Usage
6+
Build with cmake or run `make llama-llava-cli` to build it.
7+
8+
After building, run: `./llama-llava-cli` to see the usage. For example:
9+
10+
```sh
11+
./llama-llava-cli -m model_path/ggml-model-f16.gguf --mmproj model_path/mmproj-model-f16.gguf --image img_path/image.jpg -p "<|system|>\n system prompt <image><|user|>\n prompt <|assistant|>\n"
12+
```
13+
14+
**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.
15+
**note**: For GPU offloading ensure to use the `-ngl` flag just like usual
16+
17+
## GGUF conversion
18+
19+
1. Clone a GLMV-EDGE model ([2B](https://huggingface.co/THUDM/glm-edge-v-2b) or [5B](https://huggingface.co/THUDM/glm-edge-v-5b)). For example:
20+
21+
```sh
22+
git clone https://huggingface.co/THUDM/glm-edge-v-5b or https://huggingface.co/THUDM/glm-edge-v-2b
23+
```
24+
25+
2. Use `glmedge-surgery.py` to split the GLMV-EDGE model to LLM and multimodel projector constituents:
26+
27+
```sh
28+
python ./examples/llava/glmedge-surgery.py -m ../model_path
29+
```
30+
31+
4. Use `glmedge-convert-image-encoder-to-gguf.py` to convert the GLMV-EDGE image encoder to GGUF:
32+
33+
```sh
34+
python ./examples/llava/glmedge-convert-image-encoder-to-gguf.py -m ../model_path --llava-projector ../model_path/glm.projector --output-dir ../model_path
35+
```
36+
37+
5. Use `examples/convert_hf_to_gguf.py` to convert the LLM part of GLMV-EDGE to GGUF:
38+
39+
```sh
40+
python convert_hf_to_gguf.py ../model_path
41+
```
42+
43+
Now both the LLM part and the image encoder are in the `model_path` directory.

0 commit comments

Comments
 (0)