Skip to content

Commit a970515

Browse files
sfallahbluebreadCISCngxson
authored
mtmd: Add DeepSeekOCR Support (ggml-org#17400)
* mtmd: llama.cpp DeepSeekOCR support init commit * loading sam tensors * mtmd: fix vision model processing * deepseek-ocr clip-vit model impl * mtmd: add DeepSeek-OCR LM support with standard attention * mtmd: successfully runs DeepSeek-OCR LM in llama-cli * mtmd: Fix RoPE type for DeepSeek-OCR LM. * loading LM testing Vision model loading * sam warmup working * sam erroneous return corrected * clip-vit: corrected cls_embd concat * clip-vit: model convert qkv_proj split * corrected combining of image encoders' results * fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model * concat image_newline and image_seperator tokens * visual_model warmup (technically) works * window partitioning using standard ggml ops * sam implementation without using CPU only ops * clip: fixed warnings * Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr * mtmd: fix get_rel_pos * mtmd: fixed the wrong scaler for get_rel_pos * image encoding technically works but the output can't be checked singe image decoding fails * mtmd: minor changed * mtmd: add native resolution support * - image encoding debugged - issues fixed mainly related wrong config like n_patches etc. - configs need to be corrected in the converter * mtmd: correct token order * - dynamic resizing - changes are concerning PR sfallah#4 * mtmd: quick fix token order * mtmd: fix danling pointer * mtmd: SAM numerically works * mtmd: debug CLIP-L (vit_pre_ln) * mtmd: debug CLIP-L & first working DeepSeek-OCR model * mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work * mtmd: simplify SAM patch embedding * mtmd: adapt Pillow image resizing function * mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing * mtmd: remove --dsocr-mode argument * mtmd: refactor code & remove unused helper functions * mtmd: fix tensor names for image newlines and view separator * clean up * reverting automatically removed spaces * reverting automatically removed spaces * mtmd: fixed bad ocr check in Deepseek2 (LM) * mtmd: support combined QKV projection in buid_vit * using common build_attn in sam * corrected code-branch when flash-attn disabled enabling usage of --flash-attn option * mtmd: minor fix * minor formatting and style * fixed flake8 lint issues * minor editorconfig-check fixes * minor editorconfig-check fixes * mtmd: simplify get_rel_pos * mtmd: make sam hparams configurable * mtmd: add detailed comments for resize_bicubic_pillow * mtmd: fixed wrong input setting * mtmd: convert model in FP16 * mtmd: minor fix * mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template * fix: test-1.jpg ORC issue with small (640) resolution setting min-resolution base (1024) max large (1280) for dynamic-resolution * minor: editconfig-check fix * merge with changes from ggml-org#17909 added new opt to tests.sh to disable flash-attn * minor: editconfig-check fix * testing deepseek-ocr quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR * quick and (potential) dirty merge with ggml-org#17909 * refactoring, one single builder function and static helpers * added deepseek-ocr test to tests.sh * minor formatting fixes * check with fixed expected resutls * minor formatting * editorconfig-check fix * merge with changes from ggml-org#18042 * minor - added GLM-4.6V to big tests - added missing deps for python test * convert: minor fix * mtmd: format code * convert: quick fix * convert: quick fix * minor python formatting * fixed merge build issue * merge resolved - fixed issues in convert - tested several deepseek models * minor fix * minor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * - removed clip_is_deepseekocr - removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions * - cleaning commented out code * fixing instabilities issues reintroducing resize_bicubic_pillow * - use f16 model for deepseek-ocr test - ignore llama-arch test for deepseek-ocr * rename fc_w --> mm_fc_w * add links to OCR discussion * cleaner loading code * add missing .weight to some tensors * add default jinja template (to be used by server) * move test model to ggml-org * rolling back upscale change * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: bluebread <hotbread70127@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
1 parent 056b50c commit a970515

30 files changed

Lines changed: 1569 additions & 27 deletions

convert_hf_to_gguf.py

Lines changed: 103 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -947,6 +947,9 @@ def load_hparams(dir_model: Path, is_mistral_format: bool):
947947
if "thinker_config" in config:
948948
# rename for Qwen2.5-Omni
949949
config["text_config"] = config["thinker_config"]["text_config"]
950+
if "language_config" in config:
951+
# rename for DeepSeekOCR
952+
config["text_config"] = config["language_config"]
950953
if "lfm" in config:
951954
# rename for LFM2-Audio
952955
config["text_config"] = config["lfm"]
@@ -2074,7 +2077,7 @@ class MmprojModel(ModelBase):
20742077
preprocessor_config: dict[str, Any]
20752078
global_config: dict[str, Any]
20762079

2077-
n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "encoder_layers", "vt_num_hidden_layers"]
2080+
n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers", "encoder_layers", "vt_num_hidden_layers"]
20782081

20792082
has_vision_encoder: bool = True # by default
20802083
has_audio_encoder: bool = False
@@ -6938,6 +6941,68 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
69386941
yield from super().modify_tensors(data_torch, name, bid)
69396942

69406943

6944+
@ModelBase.register("DeepseekOCRForCausalLM")
6945+
class DeepseekOCRVisionModel(MmprojModel):
6946+
def set_gguf_parameters(self):
6947+
super().set_gguf_parameters()
6948+
hparams = self.hparams
6949+
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.DEEPSEEKOCR)
6950+
# default values below are taken from HF tranformers code
6951+
self.gguf_writer.add_vision_attention_layernorm_eps(hparams.get("layer_norm_eps", 1e-6))
6952+
self.gguf_writer.add_vision_use_gelu(True)
6953+
# calculate proj_scale_factor (used by tinygemma3 test model)
6954+
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
6955+
n_per_side = int(image_seq_length ** 0.5)
6956+
image_size = self.hparams["image_size"]
6957+
patch_size = self.hparams["patch_size"]
6958+
proj_scale_factor = (image_size // patch_size) // n_per_side
6959+
if proj_scale_factor > 0 and proj_scale_factor != 4:
6960+
# we only need to write this if it's not the default value
6961+
# in this case, we are converting a test model
6962+
self.gguf_writer.add_vision_projector_scale_factor(proj_scale_factor)
6963+
# @bluebread: there's no window_size in config but just add it here anyway
6964+
self.gguf_writer.add_vision_window_size(self.hparams.get("window_size", 14))
6965+
6966+
# SAM configuration
6967+
sam_hparams = hparams['sam']
6968+
self.gguf_writer.add_vision_sam_layers_count(sam_hparams['layers'])
6969+
self.gguf_writer.add_vision_sam_embedding_length(sam_hparams['width'])
6970+
self.gguf_writer.add_vision_sam_head_count(sam_hparams['heads'])
6971+
6972+
def get_vision_config(self) -> dict[str, Any]:
6973+
vision_config: dict[str, Any] | None = self.global_config.get("vision_config")
6974+
6975+
if not vision_config:
6976+
raise ValueError("DeepseekOCR model requires 'vision_config' in the model configuration, but it was not found")
6977+
6978+
vision_config['sam'] = vision_config['width']['sam_vit_b']
6979+
vision_config.update(vision_config['width']['clip-l-14-224'])
6980+
vision_config['hidden_size'] = vision_config['width']
6981+
vision_config['num_heads'] = vision_config['heads']
6982+
vision_config['intermediate_size'] = vision_config['heads'] * 4
6983+
6984+
return vision_config
6985+
6986+
def tensor_force_quant(self, name, new_name, bid, n_dims):
6987+
if ".embeddings." in name or 'pos_embed' in name:
6988+
return gguf.GGMLQuantizationType.F32
6989+
if ".rel_pos_h" in name or '.rel_pos_w' in name:
6990+
return gguf.GGMLQuantizationType.F32
6991+
return super().tensor_force_quant(name, new_name, bid, n_dims)
6992+
6993+
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
6994+
# Only process vision-related tensors, skip language model tensors
6995+
# Vision components: sam_model, vision_model, projector, image_newline, view_seperator
6996+
# Language model components to skip: lm_head, embed_tokens, layers, norm
6997+
if name.startswith(("lm_head.", "model.embed_tokens.", "model.layers.", "model.norm.")):
6998+
return
6999+
7000+
if name.endswith("pos_embed") or name.endswith("rel_pos_h") or name.endswith("rel_pos_w"):
7001+
name += ".weight"
7002+
7003+
yield from super().modify_tensors(data_torch, name, bid)
7004+
7005+
69417006
@ModelBase.register("Gemma3nForConditionalGeneration")
69427007
class Gemma3nVisionAudioModel(ConformerAudioModel):
69437008
has_audio_encoder = True
@@ -8283,6 +8348,19 @@ class DeepseekV2Model(TextModel):
82838348

82848349
merge_expert = True
82858350

8351+
def __init__(self, *args, **kwargs):
8352+
super().__init__(*args, **kwargs)
8353+
hparams: dict = ModelBase.load_hparams(self.dir_model, is_mistral_format=False)
8354+
self.origin_hf_arch = hparams.get('architectures', [None])[0]
8355+
8356+
# special handling for Deepseek OCR
8357+
if self.origin_hf_arch == "DeepseekOCRForCausalLM":
8358+
self.model_arch = gguf.MODEL_ARCH.DEEPSEEK2OCR
8359+
self.gguf_writer.arch = gguf.MODEL_ARCH_NAMES[self.model_arch]
8360+
self.gguf_writer.add_architecture()
8361+
# default jinja template
8362+
self.gguf_writer.add_chat_template("{% for m in messages %}{{m['content']}}{% endfor %}")
8363+
82868364
def set_vocab(self):
82878365
try:
82888366
self._set_vocab_gpt2()
@@ -8338,9 +8416,15 @@ def set_vocab(self):
83388416
raise NotImplementedError(f"Deepseek pre-tokenizer {tokpre!r} is not supported yet!")
83398417

83408418
def set_gguf_parameters(self):
8419+
is_ocr = (self.model_arch == gguf.MODEL_ARCH.DEEPSEEK2OCR)
83418420

8342-
# note: deepseek2 using MLA converts into MQA (ie: GQA with 1 group)
8343-
self.hparams["num_key_value_heads"] = 1
8421+
if is_ocr:
8422+
self.hparams['rope_theta'] = self.hparams.get('rope_theta', 10000.0)
8423+
else:
8424+
# note: deepseek2 using MLA converts into MQA (ie: GQA with 1 group)
8425+
self.hparams["num_key_value_heads"] = 1
8426+
8427+
self.hparams['rms_norm_eps'] = self.hparams.get('rms_norm_eps', 1e-6)
83448428

83458429
super().set_gguf_parameters()
83468430
hparams = self.hparams
@@ -8354,16 +8438,18 @@ def set_gguf_parameters(self):
83548438
# Default: if no MoE, all layers are dense; if MoE, none are dense
83558439
first_k_dense_replace = hparams["num_hidden_layers"] if not has_moe else 0
83568440
self.gguf_writer.add_leading_dense_block_count(first_k_dense_replace)
8441+
kv_lora_rank = hparams.get("kv_lora_rank", 512)
83578442
self.gguf_writer.add_vocab_size(hparams["vocab_size"])
83588443
if "q_lora_rank" in hparams and hparams["q_lora_rank"] is not None:
83598444
self.gguf_writer.add_q_lora_rank(hparams["q_lora_rank"])
8360-
self.gguf_writer.add_kv_lora_rank(hparams["kv_lora_rank"])
83618445

83628446
# note: deepseek2 using MLA converts into MQA with larger heads, then decompresses to MHA
8363-
self.gguf_writer.add_key_length(hparams["kv_lora_rank"] + hparams["qk_rope_head_dim"])
8364-
self.gguf_writer.add_value_length(hparams["kv_lora_rank"])
8365-
self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
8366-
self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])
8447+
if not is_ocr:
8448+
self.gguf_writer.add_kv_lora_rank(kv_lora_rank)
8449+
self.gguf_writer.add_key_length(kv_lora_rank + hparams["qk_rope_head_dim"])
8450+
self.gguf_writer.add_value_length(kv_lora_rank)
8451+
self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
8452+
self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])
83678453

83688454
# MoE parameters (required by C++ code for DEEPSEEK2 arch)
83698455
# For non-MoE models like Youtu, use intermediate_size as expert_feed_forward_length
@@ -8395,8 +8481,15 @@ def set_gguf_parameters(self):
83958481
_experts: list[dict[str, Tensor]] | None = None
83968482

83978483
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
8398-
# skip vision tensors and remove "language_model." for Kimi-VL and Kimi-K2.5
8399-
if "vision_tower" in name or "multi_modal_projector" in name or "mm_projector" in name:
8484+
# skip vision tensors and remove "language_model." for Kimi-VL and Kimi-K2.5, and DeepSeek-OCR
8485+
if ("vision_tower" in name
8486+
or "multi_modal_projector" in name
8487+
or "mm_projector" in name
8488+
or "vision_model" in name
8489+
or "image_newline" in name
8490+
or "model.projector" in name
8491+
or "sam_model" in name
8492+
or "view_seperator" in name):
84008493
return
84018494
if name.startswith("siglip2.") or name.startswith("merger."):
84028495
return

docs/multimodal.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,13 @@ llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.g
3131
llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload
3232
```
3333

34+
> [!IMPORTANT]
35+
>
36+
> OCR models are trained with specific prompt and input structure, please refer to these discussions for more info:
37+
> - PaddleOCR-VL: https://github.com/ggml-org/llama.cpp/pull/18825
38+
> - GLM-OCR: https://github.com/ggml-org/llama.cpp/pull/19677
39+
> - Deepseek-OCR: https://github.com/ggml-org/llama.cpp/pull/17400
40+
3441
## Pre-quantized models
3542

3643
These are ready-to-use models, most of them come with `Q4_K_M` quantization by default. They can be found at the Hugging Face page of the ggml-org: https://huggingface.co/collections/ggml-org/multimodal-ggufs-68244e01ff1f39e5bebeeedc

ggml/src/ggml.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4962,6 +4962,7 @@ static struct ggml_tensor * ggml_interpolate_impl(
49624962
GGML_ASSERT((mode & 0xFF) < GGML_SCALE_MODE_COUNT);
49634963
// TODO: implement antialias for modes other than bilinear
49644964
GGML_ASSERT(!(mode & GGML_SCALE_FLAG_ANTIALIAS) || (mode & 0xFF) == GGML_SCALE_MODE_BILINEAR);
4965+
GGML_ASSERT(a->type == GGML_TYPE_F32);
49654966

49664967
struct ggml_tensor * result = ggml_new_tensor_4d(ctx, a->type, ne0, ne1, ne2, ne3);
49674968

@@ -5307,6 +5308,7 @@ struct ggml_tensor * ggml_flash_attn_ext(
53075308
GGML_ASSERT(q->ne[3] == v->ne[3]);
53085309

53095310
if (mask) {
5311+
GGML_ASSERT(mask->type == GGML_TYPE_F16);
53105312
GGML_ASSERT(ggml_is_contiguous(mask));
53115313
//GGML_ASSERT(ggml_can_repeat_rows(mask, qk));
53125314

gguf-py/gguf/constants.py

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -326,6 +326,11 @@ class Attention:
326326
class Projector:
327327
SCALE_FACTOR = "clip.vision.projector.scale_factor"
328328

329+
class SAM:
330+
BLOCK_COUNT = "clip.vision.sam.block_count"
331+
EMBEDDING_LENGTH = "clip.vision.sam.embedding_length"
332+
HEAD_COUNT = "clip.vision.sam.head_count"
333+
329334
class ClipAudio:
330335
PROJECTOR_TYPE = "clip.audio.projector_type" # for mixed modality models
331336
NUM_MEL_BINS = "clip.audio.num_mel_bins"
@@ -434,6 +439,7 @@ class MODEL_ARCH(IntEnum):
434439
ARCTIC = auto()
435440
DEEPSEEK = auto()
436441
DEEPSEEK2 = auto()
442+
DEEPSEEK2OCR = auto()
437443
CHATGLM = auto()
438444
GLM4 = auto()
439445
GLM4_MOE = auto()
@@ -755,6 +761,22 @@ class MODEL_TENSOR(IntEnum):
755761
V_MM_GATE = auto() # cogvlm
756762
V_TOK_BOI = auto() # cogvlm
757763
V_TOK_EOI = auto() # cogvlm
764+
V_SAM_POS_EMBD = auto() # Deepseek-OCR
765+
V_SAM_PATCH_EMBD = auto() # Deepseek-OCR
766+
V_SAM_PRE_NORM = auto() # Deepseek-OCR
767+
V_SAM_POST_NORM = auto() # Deepseek-OCR
768+
V_SAM_ATTN_POS_H = auto() # Deepseek-OCR
769+
V_SAM_ATTN_POS_W = auto() # Deepseek-OCR
770+
V_SAM_ATTN_QKV = auto() # Deepseek-OCR
771+
V_SAM_ATTN_OUT = auto() # Deepseek-OCR
772+
V_SAM_MLP_LIN_1 = auto() # Deepseek-OCR
773+
V_SAM_MLP_LIN_2 = auto() # Deepseek-OCR
774+
V_SAM_NECK = auto() # Deepseek-OCR
775+
V_SAM_NET_2 = auto() # Deepseek-OCR
776+
V_SAM_NET_3 = auto() # Deepseek-OCR
777+
V_ENC_EMBD_IMGNL = auto() # Deepseek-OCR
778+
V_ENC_EMBD_VSEP = auto() # Deepseek-OCR
779+
758780
# audio (mtmd)
759781
A_ENC_EMBD_POS = auto()
760782
A_ENC_EMBD_NORM = auto()
@@ -880,6 +902,7 @@ class MODEL_TENSOR(IntEnum):
880902
MODEL_ARCH.ARCTIC: "arctic",
881903
MODEL_ARCH.DEEPSEEK: "deepseek",
882904
MODEL_ARCH.DEEPSEEK2: "deepseek2",
905+
MODEL_ARCH.DEEPSEEK2OCR: "deepseek2-ocr",
883906
MODEL_ARCH.CHATGLM: "chatglm",
884907
MODEL_ARCH.GLM4: "glm4",
885908
MODEL_ARCH.GLM4_MOE: "glm4moe",
@@ -1199,6 +1222,22 @@ class MODEL_TENSOR(IntEnum):
11991222
MODEL_TENSOR.V_MM_GATE: "mm.gate",
12001223
MODEL_TENSOR.V_TOK_BOI: "v.boi",
12011224
MODEL_TENSOR.V_TOK_EOI: "v.eoi",
1225+
# DeepSeek-OCR SAM
1226+
MODEL_TENSOR.V_SAM_POS_EMBD: "v.sam.pos_embd",
1227+
MODEL_TENSOR.V_SAM_PATCH_EMBD: "v.sam.patch_embd",
1228+
MODEL_TENSOR.V_SAM_PRE_NORM: "v.sam.blk.{bid}.pre_ln",
1229+
MODEL_TENSOR.V_SAM_POST_NORM: "v.sam.blk.{bid}.post_ln",
1230+
MODEL_TENSOR.V_SAM_ATTN_POS_H: "v.sam.blk.{bid}.attn.pos_h",
1231+
MODEL_TENSOR.V_SAM_ATTN_POS_W: "v.sam.blk.{bid}.attn.pos_w",
1232+
MODEL_TENSOR.V_SAM_ATTN_QKV: "v.sam.blk.{bid}.attn.qkv",
1233+
MODEL_TENSOR.V_SAM_ATTN_OUT: "v.sam.blk.{bid}.attn.out",
1234+
MODEL_TENSOR.V_SAM_MLP_LIN_1: "v.sam.blk.{bid}.mlp.lin1",
1235+
MODEL_TENSOR.V_SAM_MLP_LIN_2: "v.sam.blk.{bid}.mlp.lin2",
1236+
MODEL_TENSOR.V_SAM_NECK: "v.sam.neck.{bid}",
1237+
MODEL_TENSOR.V_SAM_NET_2: "v.sam.net_2",
1238+
MODEL_TENSOR.V_SAM_NET_3: "v.sam.net_3",
1239+
MODEL_TENSOR.V_ENC_EMBD_IMGNL: "v.image_newline", # Deepseek-OCR
1240+
MODEL_TENSOR.V_ENC_EMBD_VSEP: "v.view_seperator", # Deepseek-OCR
12021241
# audio (mtmd)
12031242
# note: all audio tensor names must use prefix "a." or "mm.a."
12041243
MODEL_TENSOR.A_ENC_EMBD_POS: "a.position_embd",
@@ -1265,6 +1304,8 @@ class MODEL_TENSOR(IntEnum):
12651304
MODEL_TENSOR.V_ENC_EMBD_PATCH,
12661305
MODEL_TENSOR.V_ENC_EMBD_NORM,
12671306
MODEL_TENSOR.V_ENC_EMBD_POS,
1307+
MODEL_TENSOR.V_ENC_EMBD_IMGNL,
1308+
MODEL_TENSOR.V_ENC_EMBD_VSEP,
12681309
MODEL_TENSOR.V_ENC_INPUT_NORM,
12691310
MODEL_TENSOR.V_ENC_ATTN_QKV,
12701311
MODEL_TENSOR.V_ENC_ATTN_Q,
@@ -1317,6 +1358,19 @@ class MODEL_TENSOR(IntEnum):
13171358
MODEL_TENSOR.V_MM_GATE,
13181359
MODEL_TENSOR.V_TOK_BOI,
13191360
MODEL_TENSOR.V_TOK_EOI,
1361+
MODEL_TENSOR.V_SAM_POS_EMBD,
1362+
MODEL_TENSOR.V_SAM_PATCH_EMBD,
1363+
MODEL_TENSOR.V_SAM_PRE_NORM,
1364+
MODEL_TENSOR.V_SAM_POST_NORM,
1365+
MODEL_TENSOR.V_SAM_ATTN_POS_H,
1366+
MODEL_TENSOR.V_SAM_ATTN_POS_W,
1367+
MODEL_TENSOR.V_SAM_ATTN_QKV,
1368+
MODEL_TENSOR.V_SAM_ATTN_OUT,
1369+
MODEL_TENSOR.V_SAM_MLP_LIN_1,
1370+
MODEL_TENSOR.V_SAM_MLP_LIN_2,
1371+
MODEL_TENSOR.V_SAM_NECK,
1372+
MODEL_TENSOR.V_SAM_NET_2,
1373+
MODEL_TENSOR.V_SAM_NET_3,
13201374
# audio
13211375
MODEL_TENSOR.A_ENC_EMBD_POS,
13221376
MODEL_TENSOR.A_ENC_EMBD_NORM,
@@ -2612,7 +2666,41 @@ class MODEL_TENSOR(IntEnum):
26122666
MODEL_TENSOR.ATTN_Q_B,
26132667
MODEL_TENSOR.ATTN_KV_A_MQA,
26142668
MODEL_TENSOR.ATTN_KV_B,
2669+
MODEL_TENSOR.ATTN_K,
2670+
MODEL_TENSOR.ATTN_K_B,
2671+
MODEL_TENSOR.ATTN_V,
2672+
MODEL_TENSOR.ATTN_V_B,
2673+
MODEL_TENSOR.ATTN_Q_A_NORM,
2674+
MODEL_TENSOR.ATTN_KV_A_NORM,
2675+
MODEL_TENSOR.ATTN_OUT,
2676+
MODEL_TENSOR.ATTN_ROT_EMBD,
2677+
MODEL_TENSOR.FFN_GATE_INP,
2678+
MODEL_TENSOR.FFN_NORM,
2679+
MODEL_TENSOR.FFN_GATE,
2680+
MODEL_TENSOR.FFN_DOWN,
2681+
MODEL_TENSOR.FFN_UP,
2682+
MODEL_TENSOR.FFN_GATE_EXP,
2683+
MODEL_TENSOR.FFN_DOWN_EXP,
2684+
MODEL_TENSOR.FFN_UP_EXP,
2685+
MODEL_TENSOR.FFN_GATE_SHEXP,
2686+
MODEL_TENSOR.FFN_DOWN_SHEXP,
2687+
MODEL_TENSOR.FFN_UP_SHEXP,
2688+
MODEL_TENSOR.FFN_EXP_PROBS_B,
2689+
],
2690+
MODEL_ARCH.DEEPSEEK2OCR: [
2691+
MODEL_TENSOR.TOKEN_EMBD,
2692+
MODEL_TENSOR.OUTPUT_NORM,
2693+
MODEL_TENSOR.OUTPUT,
2694+
MODEL_TENSOR.ROPE_FREQS,
2695+
MODEL_TENSOR.ATTN_NORM,
2696+
MODEL_TENSOR.ATTN_Q,
2697+
MODEL_TENSOR.ATTN_Q_A,
2698+
MODEL_TENSOR.ATTN_Q_B,
2699+
MODEL_TENSOR.ATTN_KV_A_MQA,
2700+
MODEL_TENSOR.ATTN_KV_B,
2701+
MODEL_TENSOR.ATTN_K,
26152702
MODEL_TENSOR.ATTN_K_B,
2703+
MODEL_TENSOR.ATTN_V,
26162704
MODEL_TENSOR.ATTN_V_B,
26172705
MODEL_TENSOR.ATTN_Q_A_NORM,
26182706
MODEL_TENSOR.ATTN_KV_A_NORM,
@@ -3741,6 +3829,10 @@ class MODEL_TENSOR(IntEnum):
37413829
MODEL_TENSOR.ROPE_FREQS,
37423830
MODEL_TENSOR.ATTN_ROT_EMBD,
37433831
],
3832+
MODEL_ARCH.DEEPSEEK2OCR: [
3833+
MODEL_TENSOR.ROPE_FREQS,
3834+
MODEL_TENSOR.ATTN_ROT_EMBD,
3835+
],
37443836
MODEL_ARCH.CHATGLM: [
37453837
MODEL_TENSOR.ROPE_FREQS,
37463838
],
@@ -3938,6 +4030,7 @@ class VisionProjectorType:
39384030
LIGHTONOCR = "lightonocr"
39394031
COGVLM = "cogvlm"
39404032
JANUS_PRO = "janus_pro"
4033+
DEEPSEEKOCR = "deepseekocr"
39414034
LFM2A = "lfm2a" # audio
39424035
MUSIC_FLAMINGO = "musicflamingo" # audio
39434036
GLM4V = "glm4v"

gguf-py/gguf/gguf_writer.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1218,6 +1218,15 @@ def add_vision_is_deepstack_layers(self, layers: Sequence[bool]) -> None:
12181218
def add_vision_window_size(self, value: int) -> None:
12191219
self.add_uint32(Keys.ClipVision.WINDOW_SIZE, value)
12201220

1221+
def add_vision_sam_layers_count(self, value: int) -> None:
1222+
self.add_uint32(Keys.ClipVision.SAM.BLOCK_COUNT, value)
1223+
1224+
def add_vision_sam_embedding_length(self, value: int) -> None:
1225+
self.add_uint32(Keys.ClipVision.SAM.EMBEDDING_LENGTH, value)
1226+
1227+
def add_vision_sam_head_count(self, value: int) -> None:
1228+
self.add_uint32(Keys.ClipVision.SAM.HEAD_COUNT, value)
1229+
12211230
# audio models
12221231

12231232
def add_clip_audio_projector_type(self, value: str) -> None:

0 commit comments

Comments
 (0)