Skip to content

Conversation

@Acly
Copy link
Collaborator

@Acly Acly commented Oct 26, 2025

Small improvement to graph allocation with multiple buffers/chunks:

In the case where a tensor is allocated, and no free block fits, the current implementation allocates additional memory in the first chunk that can fit the tensor into the max size. The last block can contain both reusable (previously allocated then freed) memory, as well as memory not allocated yet. This PR prioritizes chunks with reusable memory that fits the tensor to reduce total allocation size.

See #16759 for an example.

Vulkan compute buffer size for llama-bench --model llama-2-7b.Q4_0.gguf --n-gpu-layers 19 --ubatch-size 512:

n-prompt master PR
--n-prompt 12200 1003.88 MiB 1003.88 MiB
--n-prompt 12500 1711.19 MiB 1026.94 MiB
--n-prompt 13500 1844.88 MiB 1106.38 MiB
--n-prompt 14500 1188.38 MiB 1188.38 MiB
--n-prompt 15500 1267.81 MiB 1267.81 MiB

I tested some other models and they show similar behavior around the 1024 MiB threshold.

@Acly Acly requested a review from slaren as a code owner October 26, 2025 18:25
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 26, 2025
@slaren slaren merged commit 3470a5c into ggml-org:master Oct 26, 2025
72 checks passed
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 27, 2025
theo77186 pushed a commit to theo77186/llama.cpp that referenced this pull request Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vulkan: Odd compute buffer behaviors at specific context breakpoints version b6568 and above

2 participants