Closed
Description
Running llama.cpp #5832 (9731134)
I'm trying to load a model on two GPUs with Vulkan.
My GPUs have 20 and 11 gigs of VRAM
Loading a Q6_K quant of size 26.27 GiB (6.56 BPW)
with -ts "20,11" -c 512
yields:
ggml ctx size = 0.62 MiB
offloading 60 repeating layers to GPU
offloading non-repeating layers to GPU
offloaded 61/61 layers to GPU
Vulkan0 buffer size = 17458.44 MiB
Vulkan1 buffer size = 9088.14 MiB
CPU buffer size = 358.90 MiB
Vulkan0 KV buffer size = 80.00 MiB
Vulkan1 KV buffer size = 40.00 MiB
KV self size = 120.00 MiB, K (f16): 60.00 MiB, V (f16): 60.00 MiB
Vulkan_Host input buffer size = 16.01 MiB
Vulkan0 compute buffer size = 113.00 MiB
Vulkan1 compute buffer size = 139.00 MiB
Vulkan_Host compute buffer size = 14.00 MiB
ggml_vulkan: Device memory allocation of size 120422400 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
The math doesn't seem to add up.
A Q5_K_M quant at 22.65 GiB (5.66 BPW)
works perfectly fine until I increase the context to 4096.
This can't possibly be context, right? When using HIP on smaller models, I have to push it much harder to OOM, I should be fine with 31GB of VRAM.
Any idea why this happens?