Vulkan: compute buffers +630mb with 1.100 vs .99?

I'm running into "failed to allocate compute pp buffers" with configurations that I've been using with versions 1.99 and below. For running 1.100, I'm having to drop layers and reduce context size to make everything fit. Comparing the two logs, same model and settings, it appears like the requested compute buffer size has grown by, what, 60%?, and I'm running things pretty tight so the failure is expected. Is that increase expected for some reason, and/or am I being stupid?

```
Welcome to KoboldCpp - Version 1.100...

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: relocated tensors: 12 of 291

PrefetchVirtualMemory skipped in compatibility mode.
load_tensors: offloading 31 repeating layers to GPU
load_tensors: offloaded 31/33 layers to GPU
load_tensors:      Vulkan0 model buffer size =  3860.00 MiB
load_tensors:   CPU_Mapped model buffer size =  4685.30 MiB

Automatic RoPE Scaling: Using (scale:1.000, base:1776946.1).
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16512
llama_context: n_ctx_per_seq = 16512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = true
llama_context: freq_base     = 1776946.1
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16512) > n_ctx_train (8192) -- possible training context overflow
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.49 MiB
create_memory: n_ctx = 16512 (padded)
llama_kv_cache:    Vulkan0 KV buffer size =  1999.50 MiB
llama_kv_cache:        CPU KV buffer size =    64.50 MiB
llama_kv_cache: size = 2064.00 MiB ( 16512 cells,  32 layers,  1/1 seqs), K (f16): 1032.00 MiB, V (f16): 1032.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2328
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
ggml_vulkan: Device memory allocation of size 1082130432 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 1784127488
graph_reserve: failed to allocate compute buffers
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
gpttype_load_model: error: failed to load model
```

```
Welcome to KoboldCpp - Version 1.99.4.....

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: relocated tensors: 12 of 291

PrefetchVirtualMemory skipped in compatibility mode.
load_tensors: offloading 31 repeating layers to GPU
load_tensors: offloaded 31/33 layers to GPU
load_tensors:      Vulkan0 model buffer size =  3860.00 MiB
load_tensors:   CPU_Mapped model buffer size =  4685.30 MiB

Automatic RoPE Scaling: Using (scale:1.000, base:1776946.1).
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16512
llama_context: n_ctx_per_seq = 16512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = true
llama_context: freq_base     = 1776946.1
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16512) > n_ctx_train (8192) -- possible training context overflow
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.49 MiB
create_memory: n_ctx = 16512 (padded)
llama_kv_cache:    Vulkan0 KV buffer size =  1999.50 MiB
llama_kv_cache:        CPU KV buffer size =    64.50 MiB
llama_kv_cache: size = 2064.00 MiB ( 16512 cells,  32 layers,  1/1 seqs), K (f16): 1032.00 MiB, V (f16): 1032.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2328
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
llama_context:    Vulkan0 compute buffer size =  1153.75 MiB
llama_context: Vulkan_Host compute buffer size =    44.26 MiB
llama_context: graph nodes  = 1126
llama_context: graph splits = 15 (with bs=512), 3 (with bs=1)
Threadpool set to 4 threads and 8 blasthreads...

Power Throttling skipped in compatibility mode.
attach_threadpool: call
Starting model warm up, please wait a moment...
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
```

EDIT: Version 1.100.1 same thing. Reverting back to koboldcpp_vulkan_noavx2.dll from 99.4 solves it, but I'm guessing that kills the image gen related changes? (Yep, obv.) I haven't made any serious attempt at torturing my old hardware with any of that yet. ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vulkan: compute buffers +630mb with 1.100 vs .99? #1805

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Vulkan: compute buffers +630mb with 1.100 vs .99? #1805

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions