Skip to content

Multi GPU with Vulkan out of memory issue. #5848

Closed
@lastrosade

Description

@lastrosade

Running llama.cpp #5832 (9731134)

I'm trying to load a model on two GPUs with Vulkan.

My GPUs have 20 and 11 gigs of VRAM

Loading a Q6_K quant of size 26.27 GiB (6.56 BPW) with -ts "20,11" -c 512 yields:

ggml ctx size =    0.62 MiB
offloading 60 repeating layers to GPU
offloading non-repeating layers to GPU
offloaded 61/61 layers to GPU
   Vulkan0 buffer size = 17458.44 MiB
   Vulkan1 buffer size =  9088.14 MiB
       CPU buffer size =   358.90 MiB

Vulkan0 KV buffer size =    80.00 MiB
Vulkan1 KV buffer size =    40.00 MiB

KV self size  =  120.00 MiB, K (f16):   60.00 MiB, V (f16):   60.00 MiB
Vulkan_Host input buffer size   =    16.01 MiB
   Vulkan0 compute buffer size =   113.00 MiB
   Vulkan1 compute buffer size =   139.00 MiB
Vulkan_Host compute buffer size =    14.00 MiB

ggml_vulkan: Device memory allocation of size 120422400 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory

The math doesn't seem to add up.

A Q5_K_M quant at 22.65 GiB (5.66 BPW) works perfectly fine until I increase the context to 4096.

This can't possibly be context, right? When using HIP on smaller models, I have to push it much harder to OOM, I should be fine with 31GB of VRAM.
Any idea why this happens?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions