UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) #725

loci-dev · 2025-12-28T02:16:20Z

Hello @ngxson, I'm back! How does this look for the first PR? I'm open to any feedback.

Original Model: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
GGUFs: https://huggingface.co/TrevorJS/Qwen3-Omni-30B-A3B-GGUF

This PR implements the thinker model only, providing just text -> text.

thinker-f16 on dgx-spark:

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |          pp2048 |      1856.94 ± 11.77 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |            tg32 |         34.88 ± 0.06 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |       1692.98 ± 4.34 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         32.07 ± 0.12 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       1552.70 ± 1.64 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         29.64 ± 0.14 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |       1304.71 ± 2.41 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         26.26 ± 0.03 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |       1001.73 ± 1.68 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         21.43 ± 0.02 |

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b0-unknown
model      : thinker-f16.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Why write smaller PRs? Respond with less than 10 words.

Easier to review, test, and merge quickly.

[ Prompt: 68.6 t/s | Generation: 31.5 t/s ]

>

AI Disclosure

AI was used to write this code, but it was then reviewed, tested, and benchmarked by a human!

Add support for Qwen3-Omni Thinker, a 48-layer MoE model with 128 experts (8 active per token) and optional shared expert. This enables text-only inference as the foundation for full multimodal support. Key changes: - New architecture: LLM_ARCH_QWEN3OMNIMOE - GGUF conversion with nested thinker_config handling - IMRoPE (Interleaved M-RoPE) with sections [24, 20, 20, 0] - Shared expert support in qwen3vl-moe graph builder - Reuses llm_build_qwen3vlmoe for graph construction

loci-agentic-ai · 2025-12-28T03:07:02Z

Explore the complete analysis inside the Version Insights

Perfect! I've successfully retrieved the summary report for your project. The report shows a performance analysis for Pull Request #725 in the llama.cpp repository (auroralabs-loci).

Key Highlights:

Most Critical Issue: The std::vector::end() function shows a significant 226% increase in response time (from 81.11ns to 264.40ns)
Affected Areas: Most performance impacts are in STL container operations, particularly:
- Vector operations
- Hash table operations
- Tree and deque operations
Interesting Pattern: While response times increased, throughput also increased in most cases, which might indicate changes in parallelization or workload distribution
Top Recommendation: Investigate changes to vector iteration patterns and STL container usage in PR UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) #725

Would you like me to provide more detailed information about any specific function or aspect of this performance analysis?

loci-dev temporarily deployed to PROD__AL_DEMO December 28, 2025 02:16 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 9 times, most recently from f2e8c7f to b3f45e1 Compare December 29, 2025 06:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) #725

UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) #725

loci-dev commented Dec 28, 2025

Uh oh!

loci-agentic-ai bot commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) #725

Are you sure you want to change the base?

UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) #725

Conversation

loci-dev commented Dec 28, 2025

AI Disclosure

Uh oh!

loci-agentic-ai bot commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants