Skip to content

[Model] Qwen3.5 dense and MoE support (no vision)#19435

Merged
pwilkin merged 16 commits into
ggml-org:masterfrom
pwilkin:qwen35
Feb 8, 2026
Merged

[Model] Qwen3.5 dense and MoE support (no vision)#19435
pwilkin merged 16 commits into
ggml-org:masterfrom
pwilkin:qwen35

Conversation

@pwilkin

@pwilkin pwilkin commented Feb 8, 2026

Copy link
Copy Markdown
Member

I've gotten a bit tired of Llama.cpp missing all the zero-day releases, so this time I decided to make (or, more precisely, instructed Opus 4.6 to make, based on reference implementations and my guidelines for model adaptation) a conversion based on the Transformers PR ( https://github.com/huggingface/transformers/pull/43830/changes ). It's mostly based on Qwen3Next, but it's rebased on the common-delta-net PR ( #19125 ).

Here are the mock models I generated to test it: https://huggingface.co/ilintar/qwen35_testing/tree/main

Here are the conversion results from causal-verify-logits:

Model NMSE NMSE (dB) Result
Dense 8.94e-06 -50.49 dB Excellent
MoE 9.36e-05 -40.29 dB Excellent

@github-actions github-actions Bot added model Model specific python python script changes labels Feb 8, 2026
@ggerganov

Copy link
Copy Markdown
Member

instructed Opus 4.6 to make, based on reference implementations and my guidelines for model adaptation

A bit off-topic, but I'm curious to try the same task with a local model. Maybe GLM 4.7 Flash + OpenCode. Do you have the guidelines that you used somewhere shareable?

@pwilkin

pwilkin commented Feb 8, 2026

Copy link
Copy Markdown
Member Author

A bit off-topic, but I'm curious to try the same task with a local model. Maybe GLM 4.7 Flash + OpenCode. Do you have the guidelines that you used somewhere shareable?

I'm using my "adding model architectures" tutorial for this actually (#16770) + got an extra rules section in my agent MD reminding about the tensor layout:

## Tensor format

A very important caveat about tensor format in the GGML library: the GGML library uses a different format internally than most Python implementations,
including PyTorch or Transformers. Notably, in the GGML library:

* tensors are restricted to 4 dimensions - whenever you want to use more dimensions you have to pack them
* the semantic order of dimensions is reversed from PyTorch/Transformers - the *last two* dimensions are used for `[tokens_per_batch, batches]` - even though the physical layout is the same

So, for example, a `[1, 5, 4096, 1]` tensor in PyTorch will probably become a `[1, 4096, 5, 1]` tensor in GGML. This is especially important
when converting certain implementations from a reference implementation written in Python, because the tensor dimension ordering will be different,
but they can be semantically the same.

The prompt I used this time:

"In the transformers directory, I have included a new version of Transformers that includes support for the new Qwen3.5 series of models (MoE and dense). @transformers/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py @transformers/src/transformers/models/qwen3_5/modeling_qwen3_5.py 

I have also created a script and generated mock models for testing, the script is in @transformers/generate_qwen_models.py and has been already run.

Based on the implementation, which seems to be based strongly on the Qwen3 Next architecture, currently handled in Llama.cpp in @reference/qwen3next.cpp and in @src/models/delta.cpp, please create the implementation and conversion code for the new architectures. There are scripts in @examples/model-conversion/Makefile for testing conversions of new models. 

See @.roo/rules/adding_new_models.md for tips on how to add new model support. Make sure to reuse as much existing code as possible. For now, only add text support, without the multimodal capabilities."

EDIT: Key thing was also interrupting the model when it starts to do something stupid, for example modifying permutation rules in delta_net because it added incorrect extra permutations in conversion. Even with Opus 4.6 I had to do it like 3 or 4 times to stop it from going in the wrong direction.

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

deep-code-modifier.md
code-architecture-analyzer.md

@CISC

CISC commented Feb 8, 2026

Copy link
Copy Markdown
Member

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

/me sits back and waits for the automated PR submissions whenever a new model pops up...

@ggerganov

ggerganov commented Feb 8, 2026

Copy link
Copy Markdown
Member

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

/me sits back and waits for the automated PR submissions whenever a new model pops up...

github actions workflow, runs on self-hosted mac mini, accepts vllm/transformers PR as input

@am17an

am17an commented Feb 8, 2026

Copy link
Copy Markdown
Contributor

I think it maybe now makes sense to have a dedicated operator for delta_net to eliminate all those cpy_scalars

Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread gguf-py/gguf/tensor_mapping.py Outdated
Comment thread src/models/qwen3-5moe.cpp Outdated
Comment thread src/models/qwen3-5moe.cpp Outdated
Comment thread src/models/qwen3-5moe.cpp Outdated
Comment thread src/models/qwen3-5moe.cpp Outdated
Comment thread src/models/qwen3-5moe.cpp Outdated
@pwilkin

pwilkin commented Feb 8, 2026

Copy link
Copy Markdown
Member Author

@CISC actually I just made the MoE class inherit from the normal one since most of the code is duplicated :)

@pwilkin

pwilkin commented Feb 8, 2026

Copy link
Copy Markdown
Member Author

@CISC BTW reportedly super(class, self).method(...) is the preferred way to call superclasses under multiple inheritance over superclass.method(self, ...)

@CISC

CISC commented Feb 8, 2026

Copy link
Copy Markdown
Member

@CISC BTW reportedly super(class, self).method(...) is the preferred way to call superclasses under multiple inheritance over superclass.method(self, ...)

No, absolutely not, it does not do what you think it does.

Edit: As a thought experiment; super() is shorthand for super(MyClass, self).

@pwilkin

pwilkin commented Feb 8, 2026

Copy link
Copy Markdown
Member Author

No, absolutely not, it does not do what you think it does.

WDYM? I though super(AClass, self).method means basically "call method for superclass of (self interpreted as instance of AClass)", isn't that what it does?

@CISC

CISC commented Feb 8, 2026

Copy link
Copy Markdown
Member

Not quite (in that case super() would call its own method).

Comment thread src/models/qwen3-5.cpp Outdated
@pwilkin

pwilkin commented Feb 8, 2026

Copy link
Copy Markdown
Member Author

Not quite (in that case super() would call its own method).

But why, if it calls the method of the "superclass of AClass"?

The way I understand it:

  • under single inheritance, if A is subclass of B, then super(A, self).method(...) is equivalent to B.method(self, ...)
  • under multiple inheritance, if A is subclass of B and C, then super(A, self).method(...) means calling both B.method and C.method if they exist

Am I misunderstanding something?

@CISC

CISC commented Feb 8, 2026

Copy link
Copy Markdown
Member

Am I misunderstanding something?

I'll repeat the example. :)

super().method -> super(MyClass, self).method, ie calling the method of the parent of MyClass.
super(ParentClass, self).method, calls which method? :)

@pwilkin

pwilkin commented Feb 9, 2026

Copy link
Copy Markdown
Member Author

@ggerganov I checked the perplexity and it looks fine, can you please verify? I rebased this on the fixed version of the delta_net branch, so it should be correct.

Yes, both Metal and CUDA produce higher PPL compared to before the change.

Just checked perplexity and it seems OK

Is it the same as before the PR?

Seems so, but I will test again.

@ggerganov

Copy link
Copy Markdown
Member

It looks fine, but the problem is that it is different and significantly higher.

# new
Final estimate: PPL = 5.7395 +/- 0.35952

# old (e06088da0fa86aa444409f38dff274904931c507)
Final estimate: PPL = 5.1777 +/- 0.31137

It should not change. Also, try to perform the test from #19305 - it fails.

@pwilkin

pwilkin commented Feb 9, 2026

Copy link
Copy Markdown
Member Author

@ggerganov Verifying rn.

@ggerganov

Copy link
Copy Markdown
Member

@pwilkin If you need more time to locate and fix the problem, maybe it would be better to revert the PR for now and take the time to support this properly. No need to rush it. WDYT?

@pwilkin

pwilkin commented Feb 9, 2026

Copy link
Copy Markdown
Member Author

@ggerganov give me half an hour, if I don't find it then we can do that instead, OK?

@ggerganov

Copy link
Copy Markdown
Member

I just don't have the confidence that these changes are good. The branch that you rebased on was closed #19125 and it is not clear at all that it is working correctly. IMO the cleanest thing is to go back.

@pwilkin

pwilkin commented Feb 9, 2026

Copy link
Copy Markdown
Member Author

@ggerganov Okay, you're right, I'll prepare a new one after you revert.

@ggerganov

Copy link
Copy Markdown
Member

Ok, I'll open the revert now.

@mirek190

Copy link
Copy Markdown

A bit off-topic, but I'm curious to try the same task with a local model. Maybe GLM 4.7 Flash + OpenCode. Do you have the guidelines that you used somewhere shareable?

I'm using my "adding model architectures" tutorial for this actually (#16770) + got an extra rules section in my agent MD reminding about the tensor layout:

## Tensor format

A very important caveat about tensor format in the GGML library: the GGML library uses a different format internally than most Python implementations,
including PyTorch or Transformers. Notably, in the GGML library:

* tensors are restricted to 4 dimensions - whenever you want to use more dimensions you have to pack them
* the semantic order of dimensions is reversed from PyTorch/Transformers - the *last two* dimensions are used for `[tokens_per_batch, batches]` - even though the physical layout is the same

So, for example, a `[1, 5, 4096, 1]` tensor in PyTorch will probably become a `[1, 4096, 5, 1]` tensor in GGML. This is especially important
when converting certain implementations from a reference implementation written in Python, because the tensor dimension ordering will be different,
but they can be semantically the same.

The prompt I used this time:

"In the transformers directory, I have included a new version of Transformers that includes support for the new Qwen3.5 series of models (MoE and dense). @transformers/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py @transformers/src/transformers/models/qwen3_5/modeling_qwen3_5.py 

I have also created a script and generated mock models for testing, the script is in @transformers/generate_qwen_models.py and has been already run.

Based on the implementation, which seems to be based strongly on the Qwen3 Next architecture, currently handled in Llama.cpp in @reference/qwen3next.cpp and in @src/models/delta.cpp, please create the implementation and conversion code for the new architectures. There are scripts in @examples/model-conversion/Makefile for testing conversions of new models. 

See @.roo/rules/adding_new_models.md for tips on how to add new model support. Make sure to reuse as much existing code as possible. For now, only add text support, without the multimodal capabilities."

EDIT: Key thing was also interrupting the model when it starts to do something stupid, for example modifying permutation rules in delta_net because it added incorrect extra permutations in conversion. Even with Opus 4.6 I had to do it like 3 or 4 times to stop it from going in the wrong direction.

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

deep-code-modifier.md code-architecture-analyzer.md

Have you tried to use GPT codex 5.3 xhigh with codex-cli as is better for a complex coding?

@pwilkin

pwilkin commented Feb 10, 2026

Copy link
Copy Markdown
Member Author

@mirek190 I think they're comparable, but I don't have an active Codex sub atm, just Claude.

@mirek190

Copy link
Copy Markdown

According to Matt Maher

https://www.youtube.com/watch?v=hwvyew2iXpU&t=937s

On real world usage his tests he got 86% with opus 4.6 high and 95% with codex 5.3 xhigh and claims codex is smarter on complex code.
That's something I also noticed. Seems codex is better for complex tasks than opus 4.6 and is absurd cheap because for 20 USD you can work whole week now.

I just saying....

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants