forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 612
Open
Description
Description
I'm running the multi-part GGUF model Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf (80B params, Q4_K_M quantization) using KoboldCpp v1.107 with an NVIDIA RTX 3070 (16GB VRAM), but face severe performance issues with the following constraints:
- GPU layer limit: Setting
--gpulayershigher than 15 causes out-of-memory (OOM) errors on my 16GB GPU, so I have to keep it at 15. - Batch size adjustment: I originally used
--batchsize 512, then reduced it to 32 to mitigate issues, but the core problems remain. - CUDA graphs disabled: Logs consistently show
record_update: disabling CUDA graphs due to too many consecutive updates. - CPU-dominant inference: The model runs almost entirely on CPU (only 15/48 layers on GPU) leading to extremely slow inference speed, even when processing 6144 tokens.
- Frequent state writes: I see repeated
state_write_data: writing state / writing memory modulelogs during inference, which further degrades performance.
Environment
- GPU: NVIDIA RTX 3070 (16GB VRAM, compute capability 8.6)
- KoboldCpp version: 1.107
- OS: Windows 11 64-bit
- Model: Qwen3-Coder-Next-Q4_K_M (multi-part GGUF, 4.86 BPW, 80B parameters)
Question
Given my hardware constraint (16GB VRAM, gpulayers can only be set to 15 without OOM), could you guide me on how to configure KoboldCpp parameters (including but not limited to batch size, context size, CUDA-related flags, memory optimization settings) to:
- Fix the "CUDA graphs disabled due to too many consecutive updates" issue?
- Reduce CPU usage and maximize GPU utilization within the 15 gpulayers limit?
- Eliminate frequent
state_write_datalogs and improve inference speed for this Qwen3-Coder-Next model?
Any general tuning principles or parameter strategies for 80B Q4_K_M models on 16GB GPUs would be highly appreciated.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels