Qwen3-Coder-Next (80B Q4_K_M) on 16GB GPU: CUDA graphs disabled, CPU-bound inference (gpulayers limited to 15)

#### Description
I'm running the multi-part GGUF model `Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf` (80B params, Q4_K_M quantization) using KoboldCpp v1.107 with an NVIDIA RTX 3070 (16GB VRAM), but face severe performance issues with the following constraints:
1. **GPU layer limit**: Setting `--gpulayers` higher than 15 causes out-of-memory (OOM) errors on my 16GB GPU, so I have to keep it at 15.
2. **Batch size adjustment**: I originally used `--batchsize 512`, then reduced it to 32 to mitigate issues, but the core problems remain.
3. **CUDA graphs disabled**: Logs consistently show `record_update: disabling CUDA graphs due to too many consecutive updates`.
4. **CPU-dominant inference**: The model runs almost entirely on CPU (only 15/48 layers on GPU) leading to extremely slow inference speed, even when processing 6144 tokens.
5. **Frequent state writes**: I see repeated `state_write_data: writing state / writing memory module` logs during inference, which further degrades performance.

#### Environment
- GPU: NVIDIA RTX 3070 (16GB VRAM, compute capability 8.6)
- KoboldCpp version: 1.107
- OS: Windows 11 64-bit
- Model: Qwen3-Coder-Next-Q4_K_M (multi-part GGUF, 4.86 BPW, 80B parameters)

#### Question
Given my hardware constraint (16GB VRAM, gpulayers can only be set to 15 without OOM), could you guide me on **how to configure KoboldCpp parameters** (including but not limited to batch size, context size, CUDA-related flags, memory optimization settings) to:
1. Fix the "CUDA graphs disabled due to too many consecutive updates" issue?
2. Reduce CPU usage and maximize GPU utilization within the 15 gpulayers limit?
3. Eliminate frequent `state_write_data` logs and improve inference speed for this Qwen3-Coder-Next model?

Any general tuning principles or parameter strategies for 80B Q4_K_M models on 16GB GPUs would be highly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-Coder-Next (80B Q4_K_M) on 16GB GPU: CUDA graphs disabled, CPU-bound inference (gpulayers limited to 15) #1964

Description

Environment

Question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Qwen3-Coder-Next (80B Q4_K_M) on 16GB GPU: CUDA graphs disabled, CPU-bound inference (gpulayers limited to 15) #1964

Description

Description

Environment

Question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions