Skip to content

models : deduplicate delta-net graphs for Qwen family#19597

Merged
ggerganov merged 3 commits intomasterfrom
gg/qwen3-dedup
Feb 16, 2026
Merged

models : deduplicate delta-net graphs for Qwen family#19597
ggerganov merged 3 commits intomasterfrom
gg/qwen3-dedup

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Feb 13, 2026

cont #19375

  • Add llm_build_delta_net_base for common delta net builds. Currently used only by qwen3next
  • Rename llm_graph_context_mamba -> llm_build_mamba_base
  • Includes clean-up

Next PRs:

  • Use llm_build_delta_net_base for qwen35 and qwen35moe (after release)
  • Adapt and use llm_build_delta_net_base for kimi_linear (cc @ymcki)

@github-actions github-actions bot added the model Model specific label Feb 13, 2026
@pwilkin
Copy link
Collaborator

pwilkin commented Feb 13, 2026

@ggerganov out of curiosity - what's the reason for moving the identity / mask constructions from once-per-graph to once-per-layer?

@ggerganov
Copy link
Member Author

@ggerganov out of curiosity - what's the reason for moving the identity / mask constructions from once-per-graph to once-per-layer?

Mainly wanted to get rid of the ggml_new_tensor() calls. In this case, it does not cause any issues, but it can become problematic if the same tensor created with ggml_tensor_new() would be used for different ops that would be performed on different backends (for example due to offloading or missing support). As I mentioned, it's always better to find alternatives that don't involve creating new tensors.

Btw, creating a mask and then multiplying does not have advantage compare to just masking the data (with ggml_tri()), so it seems simpler too.

Base automatically changed from gg/qwen3-next-opt to master February 14, 2026 10:57
@ggerganov ggerganov marked this pull request as ready for review February 14, 2026 12:27
@ggerganov ggerganov requested a review from CISC as a code owner February 14, 2026 12:27
@ymcki
Copy link
Contributor

ymcki commented Feb 14, 2026

Is this the new home for the unified delta_net effort? My job is to modify build_delta_net_chunking to use "if (is_kda) {} else {}" for kda and gdn?

@ggerganov
Copy link
Member Author

Yes, we'll merge this PR soon and then the new llm_build_delta_net_base should be adapted to work with both GDA and KDA. Ideally, the KDA would have to add as few extra ops as possible - likely utilizing broadcasts when possible. The KDA support should be a standalone change, without any additional optimizations in order to keep things simple.

@ggerganov ggerganov mentioned this pull request Feb 16, 2026
2 tasks
@ggerganov ggerganov merged commit cc45f2a into master Feb 16, 2026
81 checks passed
@ggerganov ggerganov deleted the gg/qwen3-dedup branch February 16, 2026 12:35
@ymcki
Copy link
Contributor

ymcki commented Feb 16, 2026

@ggreganov, I am done with adding kimi linear to delta-net-base.cpp.

https://github.com/ymcki/llama.cpp/tree/dn

The original delta net path is the same except that I added o = ggml_cont(ctx0, o); after ggml_permute at the end.

This is to avoid the assertion error:
/home/user/gguf/gg/llama.cpp/ggml/src/ggml.c:3573: GGML_ASSERT(ggml_is_contiguous(a)) failed
raised by
ggml_tensor * attn_out_final = ggml_reshape_3d(ctx0, output, head_dim, n_head, n_seq_tokens * n_seqs);

Do I open another PR? Or you will do that yourself?

@ggerganov
Copy link
Member Author

Please open a PR. I will attempt to simplify the implementation - hoping we can unify this and avoid branching on the g dimensions. Not sure if possible though - we'll see.

michaelneale added a commit to michaelneale/llama.cpp that referenced this pull request Feb 17, 2026
* upstream/master: (88 commits)
  ci : bump komac version (ggml-org#19682)
  build : link ws2_32 as PUBLIC on Windows (ggml-org#19666)
  build : cleanup library linking logic (ggml-org#19665)
  convert : add JoyAI-LLM-Flash (ggml-org#19651)
  perplexity: add proper batching (ggml-org#19661)
  common : inline functions (ggml-org#18639)
  ggml : make `ggml_is_view` as API (ggml-org#19539)
  model: Add support for Tiny Aya Models (ggml-org#19611)
  build : rework llama_option_depr to handle LLAMA_CURL (ggml-org#19658)
  Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (ggml-org#19591)
  models : deduplicate delta-net graphs for Qwen family (ggml-org#19597)
  graph : fix KQ mask, lora, cvec reuse checks (ggml-org#19644)
  ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel  (ggml-org#19132)
  sync : ggml
  ggml : bump version to 0.9.7 (ggml/1425)
  ggml : bump version to 0.9.6 (ggml/1423)
  cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (ggml-org#19624)
  docs: update s390x build docs (ggml-org#19643)
  build : remove LLAMA_HTTPLIB option (ggml-org#19623)
  cmake : check if KleidiAI API has been fetched (ggml-org#19640)
  ...
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants