Eval bug: Memory leak? using ROCm

### Name and Version

```
build/bin/llama-cli --version
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7800 XT (RADV NAVI32) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 8179 (ecbcb7ea9)
built with GNU 14.2.0 for Linux x86_64
```

### Operating systems

Linux

### GGML backends

HIP

### Hardware

Radeon RX 7800 XT 16GB
Ryzen 2700X on X470, 16GB RAM

### Models

bartowski/Qwen_Qwen3.5-27B-GGUF:IQ3_XS

### Problem description & steps to reproduce

This model comes in at ~10.8GB.

Running with ROCm and large prompt/context, VRAM usage slowly creeps up while prompt processing until it crashes. Running with Vulkan, VRAM usage stays low and it completes.

Command:
```
CUDA_VISIBLE_DEVICES=0 GGML_VISIBLE_DEVICES=""  build/bin/llama-bench -v -m ~/.cache/llama.cpp/bartowski_Qwen_Qwen3.5-27B-GGUF_Qwen_Qwen3.5-27B-IQ3_XS.gguf -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0 -p 65536
```

Watching with `amdgpu_top --smi`, VRAM usage starts out at 12600MB (10MB GTT), and VRAM usage slowly creeps up. When the VRAM size is reached, it crashes with this error:

<details>
<summary>Error + Stacktrace</summary>
```
ROCm error: out of memory
  current device: 0, in function alloc at /home/martin/upstream/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:393
  ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
/home/martin/upstream/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: ROCm error
[New LWP 773974]
[New LWP 773973]
[New LWP 773971]
[New LWP 773970]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: Datei oder Verzeichnis nicht gefunden
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007fb9a4a9a668 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: Datei oder Verzeichnis nicht gefunden
#2  0x00007fb9a4a9a6ad in __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007fb9a4b05787 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: Datei oder Verzeichnis nicht gefunden
#4  0x00007fb9a5334deb in ggml_print_backtrace () from /home/martin/upstream/llama.cpp/build/bin/libggml-base.so.0
#5  0x00007fb9a5334f3e in ggml_abort () from /home/martin/upstream/llama.cpp/build/bin/libggml-base.so.0
#6  0x00007fb9a408dd82 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#7  0x00007fb9a40a16f7 in ggml_cuda_pool_leg::alloc(unsigned long, unsigned long*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#8  0x00007fb9a41ff3d1 in void launch_fattn<256, 16, 2>(ggml_backend_cuda_context&, ggml_tensor*, void (*)(char const*, char const*, char const*, char const*, char const*, int const*, float*, HIP_vector_type<float, 2u>*, float, float, float, float, unsigned int, float, int, HIP_vector_type<unsigned int, 3u>, int, int, int, int, int, int, int, int, int, int, int, long, int, int, long, int, int, int, int, int, long), int, unsigned long, int, bool, bool, bool, int) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#9  0x00007fb9a41f47d1 in void ggml_cuda_flash_attn_ext_tile_case<256, 256>(ggml_backend_cuda_context&, ggml_tensor*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#10 0x00007fb9a4095637 in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#11 0x00007fb9a4093311 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/martin/upstream/llama.cpp/build/bin/libggml-hip.so.0
#12 0x00007fb9a53505f7 in ggml_backend_sched_graph_compute_async () from /home/martin/upstream/llama.cpp/build/bin/libggml-base.so.0
#13 0x00007fb9a50ac871 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0
#14 0x00007fb9a50aee24 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0
#15 0x00007fb9a50b428f in llama_context::decode(llama_batch const&) () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0
#16 0x00007fb9a50b5c9b in llama_decode () from /home/martin/upstream/llama.cpp/build/bin/libllama.so.0
#17 0x0000555c314e12ce in test_prompt(llama_context*, int, int, int) ()
#18 0x0000555c314ddb42 in main ()
[Inferior 1 (process 773967) detached]
```
</details>

It is directly related to prompt size: With small enough `-p` it finishes. Also `llama-server` with same setting runs fine, until the first request with too) big context comes in, then it crashes the same way. `-ngl auto` or `--fit 1` does not take this additional VRAM usage into account either.

Bonus question: why does it not page/spillover to system RAM as Vulkan does?
Also uses 200% CPU cores constantly during the whole run.

With Vulkan:
```
CUDA_VISIBLE_DEVICES="" GGML_VISIBLE_DEVICES=0  build/bin/llama-bench -v -m ~/.cache/llama.cpp/bartowski_Qwen_Qwen3.5-27B-GGUF_Qwen_Qwen3.5-27B-IQ3_XS.gguf -ngl 99 -fa 1 -ctk q4_0 -ctv q4_0 -p 65536
```
Uses constant 12200MB VRAM plus 1000MB GTT during the run. CPU usage is only 50% max.

The only issue remotely related I found was this: https://github.com/ggml-org/llama.cpp/issues/4946 but it's old and closed.

### First Bad Commit

_No response_

### Relevant log output

(see above)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Memory leak? using ROCm #19979

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: Memory leak? using ROCm #19979

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions