Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18420

Hello @ngxson, I'm back! How does this look for the first PR? I'm open to any feedback.

Original Model: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
GGUFs: https://huggingface.co/TrevorJS/Qwen3-Omni-30B-A3B-GGUF

This PR implements the thinker model only, providing just text -> text.

thinker-f16 on dgx-spark:

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |          pp2048 |      1856.94 ± 11.77 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |            tg32 |         34.88 ± 0.06 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d4096 |       1692.98 ± 4.34 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d4096 |         32.07 ± 0.12 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |  pp2048 @ d8192 |       1552.70 ± 1.64 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |    tg32 @ d8192 |         29.64 ± 0.14 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d16384 |       1304.71 ± 2.41 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d16384 |         26.26 ± 0.03 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 | pp2048 @ d32768 |       1001.73 ± 1.68 |
| qwen3omnimoe 30B F16           |  56.90 GiB |    30.53 B | CUDA       |  99 |     2048 |  1 |    0 |   tg32 @ d32768 |         21.43 ± 0.02 |
Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b0-unknown
model      : thinker-f16.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Why write smaller PRs? Respond with less than 10 words.

Easier to review, test, and merge quickly.

[ Prompt: 68.6 t/s | Generation: 31.5 t/s ]

>

AI Disclosure

AI was used to write this code, but it was then reviewed, tested, and benchmarked by a human!

Add support for Qwen3-Omni Thinker, a 48-layer MoE model with 128 experts
(8 active per token) and optional shared expert. This enables text-only
inference as the foundation for full multimodal support.

Key changes:
- New architecture: LLM_ARCH_QWEN3OMNIMOE
- GGUF conversion with nested thinker_config handling
- IMRoPE (Interleaved M-RoPE) with sections [24, 20, 20, 0]
- Shared expert support in qwen3vl-moe graph builder
- Reuses llm_build_qwen3vlmoe for graph construction
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Perfect! I've successfully retrieved the summary report for your project. The report shows a performance analysis for Pull Request #725 in the llama.cpp repository (auroralabs-loci).

Key Highlights:

  1. Most Critical Issue: The std::vector::end() function shows a significant 226% increase in response time (from 81.11ns to 264.40ns)

  2. Affected Areas: Most performance impacts are in STL container operations, particularly:

    • Vector operations
    • Hash table operations
    • Tree and deque operations
  3. Interesting Pattern: While response times increased, throughput also increased in most cases, which might indicate changes in parallelization or workload distribution

  4. Top Recommendation: Investigate changes to vector iteration patterns and STL container usage in PR UPSTREAM PR #18420: model: add Qwen3-Omni Thinker support (qwen3omnimoe) #725

Would you like me to provide more detailed information about any specific function or aspect of this performance analysis?

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from f2e8c7f to b3f45e1 Compare December 29, 2025 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants