-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Describe the Issue
I'm testing various settings using the Run Benchmark option and ticked the Use QuantMatMul option. It started out using roughly the same amount of memory as the same settings but QuantMatMul disabled but over time, it just kept dumping more and more data into shared VRAM. GPU VRAM maxed out at about 6.51GB but dropped to 6.3GB when it finished the processing stage. Shared VRAM started at around 50MB but ballooned up to 9.16GB by the time it finished the processing stage. (It also ended up with 18.66GB of RAM used.) For reference, the same settings sans QuantMatMul resulted in 7.34GB of RAM, 4.05GB of GPU VRAM, and 0.04GB of shared VRAM by the end of the benchmark.
From my understanding of QuantMatMul, it's supposed to save memory rather than inflate memory more and more over time.
Additional Information:
64-bit Windows 10, Intel 10600K CPU (running at stock speeds), 8GB AMD RX 6650XT GPU, 128GB DDR4 RAM, using the hipBLAS driver, no pagefile.
Model: Bartowski's IQ3_XXS build of Qwen3 235B a22B
KoboldCPP settings:
- 5 GPU Layers
- 16384 context
- MMAP enabled
- 8 CPU threads and BLAS threads
- 512 BLAS batch size
- FastForwarding enabled
- 6 Experts
- 2 CPU expert layers
- Tensor override:
(blk\.\d+\.(ffn_down|ffn_gate_exps|ffn_up_exps)\.weight)|(output\.weight)=CPU
Edit: I'm noticing that the memory explosion doesn't happen if I enable both QuantMatMul and Flash Attention. Only when I use QuantMatMul alone.