Name and Version
build/bin/llama-cli --version
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (RADV NAVI32) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 8179 (ecbcb7ea9)
built with GNU 14.2.0 for Linux x86_64
Operating systems
Linux
GGML backends
HIP
Hardware
Radeon RX 7800 XT 16GB
Ryzen 2700X on X470, 16GB RAM
Models
bartowski/Qwen_Qwen3.5-27B-GGUF:IQ3_XS
Problem description & steps to reproduce
This model comes in at ~10.8GB.
Running with ROCm and large prompt/context, VRAM usage slowly creeps up while prompt processing until it crashes. Running with Vulkan, VRAM usage stays low and it completes.
Command:
CUDA_VISIBLE_DEVICES=0 GGML_VISIBLE_DEVICES="" build/bin/llama-bench -v -m ~/.cache/llama.cpp/bartowski_Qwen_Qwen3.5-27B-GGUF_Qwen_Qwen3.5-27B-IQ3_XS.gguf -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0 -p 65536
Watching with amdgpu_top --smi, VRAM usage starts out at 12600MB (10MB GTT), and VRAM usage slowly creeps up. When the VRAM size is reached, it crashes with this error:
Error + Stacktrace
```
ROCm error: out of memory
current device: 0, in function alloc at /home/martin/upstream/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:393
ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
/home/martin/upstream/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: ROCm error
[New LWP 773974]
[New LWP 773973]
[New LWP 773971]
[New LWP 773970]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56 ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: Datei oder Verzeichnis nicht gefunden
#0 __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56 in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1 0x00007fb9a4a9a668 in __internal_syscall_cancel (a1=, a2=, a3=, a4=, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49 ./nptl/cancellation.c: Datei oder Verzeichnis nicht gefunden
#2 0x00007fb9a4a9a6ad in __syscall_cancel (a1=, a2=, a3=, a4=, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75 in ./nptl/cancellation.c
#3 0x00007fb9a4b05787 in __GI___wait4 (pid=, stat_loc=, options=, usage=) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: Datei oder Verzeichnis nicht gefunden
#4 0x00007fb9a5334deb in ggml_print_backtrace () from /home/martin/upstream/llama.cpp/build/bin/libggml-base.so.0
#5 0x00007fb9a5334f3e in ggml_abort () from /home/martin/upstream/llama.cpp/build/bin/libggml-base.so.0
#6 0x00007fb9a408dd82 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#7 0x00007fb9a40a16f7 in ggml_cuda_pool_leg::alloc(unsigned long, unsigned long*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#8 0x00007fb9a41ff3d1 in void launch_fattn<256, 16, 2>(ggml_backend_cuda_context&, ggml_tensor*, void (*)(char const*, char const*, char const*, char const*, char const*, int const*, float*, HIP_vector_type*, float, float, float, float, unsigned int, float, int, HIP_vector_type, int, int, int, int, int, int, int, int, int, int, int, long, int, int, long, int, int, int, int, int, long), int, unsigned long, int, bool, bool, bool, int) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#9 0x00007fb9a41f47d1 in void ggml_cuda_flash_attn_ext_tile_case<256, 256>(ggml_backend_cuda_context&, ggml_tensor*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#10 0x00007fb9a4095637 in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#11 0x00007fb9a4093311 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#12 0x00007fb9a53505f7 in ggml_backend_sched_graph_compute_async () from /home/martin/upstream/llama.cpp/build/bin/libggml-base.so.0
#13 0x00007fb9a50ac871 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0
#14 0x00007fb9a50aee24 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0
#15 0x00007fb9a50b428f in llama_context::decode(llama_batch const&) () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0
#16 0x00007fb9a50b5c9b in llama_decode () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0
#17 0x0000555c314e12ce in test_prompt(llama_context*, int, int, int) ()
#18 0x0000555c314ddb42 in main ()
[Inferior 1 (process 773967) detached]
```
It is directly related to prompt size: With small enough -p it finishes. Also llama-server with same setting runs fine, until the first request with too) big context comes in, then it crashes the same way. -ngl auto or --fit 1 does not take this additional VRAM usage into account either.
Bonus question: why does it not page/spillover to system RAM as Vulkan does?
Also uses 200% CPU cores constantly during the whole run.
With Vulkan:
CUDA_VISIBLE_DEVICES="" GGML_VISIBLE_DEVICES=0 build/bin/llama-bench -v -m ~/.cache/llama.cpp/bartowski_Qwen_Qwen3.5-27B-GGUF_Qwen_Qwen3.5-27B-IQ3_XS.gguf -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0 -p 65536
Uses constant 12200MB VRAM plus 1000MB GTT during the run. CPU usage is only 50% max.
The only issue remotely related I found was this: #4946 but it's old and closed.
First Bad Commit
No response
Relevant log output
(see above)
Name and Version
Operating systems
Linux
GGML backends
HIP
Hardware
Radeon RX 7800 XT 16GB
Ryzen 2700X on X470, 16GB RAM
Models
bartowski/Qwen_Qwen3.5-27B-GGUF:IQ3_XS
Problem description & steps to reproduce
This model comes in at ~10.8GB.
Running with ROCm and large prompt/context, VRAM usage slowly creeps up while prompt processing until it crashes. Running with Vulkan, VRAM usage stays low and it completes.
Command:
Watching with
amdgpu_top --smi, VRAM usage starts out at 12600MB (10MB GTT), and VRAM usage slowly creeps up. When the VRAM size is reached, it crashes with this error:Error + Stacktrace
``` ROCm error: out of memory current device: 0, in function alloc at /home/martin/upstream/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:393 ggml_cuda_device_malloc(&ptr, look_ahead_size, device) /home/martin/upstream/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: ROCm error [New LWP 773974] [New LWP 773973] [New LWP 773971] [New LWP 773970] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56 warning: 56 ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: Datei oder Verzeichnis nicht gefunden #0 __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56 56 in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S #1 0x00007fb9a4a9a668 in __internal_syscall_cancel (a1=, a2=, a3=, a4=, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49 warning: 49 ./nptl/cancellation.c: Datei oder Verzeichnis nicht gefunden #2 0x00007fb9a4a9a6ad in __syscall_cancel (a1=, a2=, a3=, a4=, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75 75 in ./nptl/cancellation.c #3 0x00007fb9a4b05787 in __GI___wait4 (pid=, stat_loc=, options=, usage=) at ../sysdeps/unix/sysv/linux/wait4.c:30 warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: Datei oder Verzeichnis nicht gefunden #4 0x00007fb9a5334deb in ggml_print_backtrace () from /home/martin/upstream/llama.cpp/build/bin/libggml-base.so.0 #5 0x00007fb9a5334f3e in ggml_abort () from /home/martin/upstream/llama.cpp/build/bin/libggml-base.so.0 #6 0x00007fb9a408dd82 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0 #7 0x00007fb9a40a16f7 in ggml_cuda_pool_leg::alloc(unsigned long, unsigned long*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0 #8 0x00007fb9a41ff3d1 in void launch_fattn<256, 16, 2>(ggml_backend_cuda_context&, ggml_tensor*, void (*)(char const*, char const*, char const*, char const*, char const*, int const*, float*, HIP_vector_type*, float, float, float, float, unsigned int, float, int, HIP_vector_type, int, int, int, int, int, int, int, int, int, int, int, long, int, int, long, int, int, int, int, int, long), int, unsigned long, int, bool, bool, bool, int) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0 #9 0x00007fb9a41f47d1 in void ggml_cuda_flash_attn_ext_tile_case<256, 256>(ggml_backend_cuda_context&, ggml_tensor*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0 #10 0x00007fb9a4095637 in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0 #11 0x00007fb9a4093311 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0 #12 0x00007fb9a53505f7 in ggml_backend_sched_graph_compute_async () from /home/martin/upstream/llama.cpp/build/bin/libggml-base.so.0 #13 0x00007fb9a50ac871 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0 #14 0x00007fb9a50aee24 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0 #15 0x00007fb9a50b428f in llama_context::decode(llama_batch const&) () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0 #16 0x00007fb9a50b5c9b in llama_decode () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0 #17 0x0000555c314e12ce in test_prompt(llama_context*, int, int, int) () #18 0x0000555c314ddb42 in main () [Inferior 1 (process 773967) detached] ```It is directly related to prompt size: With small enough
-pit finishes. Alsollama-serverwith same setting runs fine, until the first request with too) big context comes in, then it crashes the same way.-ngl autoor--fit 1does not take this additional VRAM usage into account either.Bonus question: why does it not page/spillover to system RAM as Vulkan does?
Also uses 200% CPU cores constantly during the whole run.
With Vulkan:
Uses constant 12200MB VRAM plus 1000MB GTT during the run. CPU usage is only 50% max.
The only issue remotely related I found was this: #4946 but it's old and closed.
First Bad Commit
No response
Relevant log output
(see above)