Skip to content

Conversation

@Acly
Copy link
Collaborator

@Acly Acly commented Oct 29, 2025

This moves buffer/uma/offset/size computations into a function, which reduces amount of repetitive code in ggml_vk_op_f32 a lot. There are more places outside of vk_op_f32 where it can be reused, I didn't adapt those yet.

I replicated the existing logic, but don't really understand why the branching is needed for sub-buffer size computation. From what I can see and test it doesn't change the result, maybe this can be simplified? Or there is some corner case I can't think of? (see questions in comments)

size_t size;
if (support_incontiguous) {
size = ggml_nbytes(tensor) + misalign_bytes;
if (offset + size >= buffer->size) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replicates the branch here https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L8620-L8631

Why is this needed? The case offset+size > buffer->size sounds like it should never happen. In the case where they're equal, the branch computes the same value that's already there.

It might clamp the size to maxStorageBufferRange but that seems like it would likely produce wrong results and should rather assert/abort?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of this logic is only here for historical reasons at this point. Early on, some shaders didn't have thorough bounds checking logic and relied on robustBufferAccess and these tight bounds to accomplish bounds checking. But robustBufferAccess isn't as reliable as it sounds, and we've improved the bounds checking logic over time.

To my knowledge, the only remaining shader that intentionally relies on robustBufferAccess bounds checking is the scalar/coopmat1 mul_mm shaders (not sure about mmq, but probably that too). These need to do the A/B loads in a way that allows the compiler to batch them, so manual bounds checking would probably interfere and cause a slowdown.

So, I think all of the current users of ggml_vk_op_f32 (and everything else except these mul_mm shaders) could just use ggml_vk_get_max_buffer_range.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I created a lot of that code for a very different version of the backend, and also when I knew a lot less about Vulkan than I do now. It's very possible there are checks and branches that are no longer needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I simplified the size to just ggml_nbytes(tensor) + misalign_bytes - this should be the actual bounds, and what the other branches ended up computing anyway.

Using ggml_vk_get_max_buffer_range also works, but feels less straight forward and it's still unclear to me if there's a reason to prefer it.

I now adapted all ops except for the mul_mat variants. They use more complex size computation. Also I want to avoid conflicts with #16868

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_vk_get_max_buffer_range is necessary for some operations when the size of the tensor is greater than 4GB (or greater than the maxBufferRange) and would lead to an invalid descriptor range being set. Please try enabling these ifdefed out tests:

#if 0
    // >4GB im2col destination. Too slow to run by default.
    // Test cases taken from Wan2.1 T2V 1.3B.
...
#if 0
    // > 4GB A matrix. Too slow to be enabled by default.
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that's what this line was for:

    } else if (op == GGML_OP_IM2COL || op == GGML_OP_IM2COL_3D) {
        if (ctx->device->shader_int64 && ctx->device->buffer_device_address) {
            // buffer device address path doesn't use dst buffer
            d_sz = 1;
        }

(and mul_mat I didn't change so far)

IMO it would be even better to not have the binding in the shader at all (put it in a #else). Why is it there if it cannot be bound properly and isn't used? Instead of setting d_sz = 1 it could just do a dispatch right there without the dst buffer.

Good hint about having to enable those tests though, I hadn't done that. Unfortunately they just OOM on my 12GB GPU, I will find some way to check them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are passing with sufficient VRAM... sort of, I had to tone them down a bit since some cases ran out of system RAM (48GB) instead and I didn't want to switch to yet another machine. They still used >>4 GB buffers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked back at my commit message for 2aaf0a2 when I added ggml_vk_get_max_buffer_range. The point was to stop using VK_WHOLE_SIZE since it could compute a size greater than maxStorageBufferRange. I think ggml_nbytes should be OK in the same circumstances, except for cases where ggml_nbytes can be greater than 4GB. Currently, that's just the im2col (which uses BDA) and the mul_mat (which you haven't changed) so I think this should be OK. ggml_vk_get_max_buffer_range is probably cheaper than ggml_nbytes, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's also what I concluded. The reason I prefer ggml_nbytes is that if it is >4GB for an OP that doesn't support it, ggml_vk_get_max_buffer_range would silently clamp and probably produce wrong results that are difficult to debug. Raising an error there is a good thing I think.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025
@Acly Acly force-pushed the vulkan-op-subbuffer branch from 09475f8 to edacb52 Compare October 31, 2025 14:23
@Acly Acly marked this pull request as ready for review October 31, 2025 14:58
return vk_subbuffer{buffer, offset, size};
}

static vk_subbuffer ggml_vk_tensor_subbuffer_uma(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO putting "uma" in the name is confusing. I'd probably combine these functions and not have the "uma" name.

Copy link
Collaborator Author

@Acly Acly Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made two functions initially because ggml_op_f32 does not do UMA handling for the dst buffer, and I wanted to preserve that. Apart from some CPU cycles it wouldn't hurt to have a single function though. I'd also be in favor of combining them if nobody has objections.

}

vk_buffer d_D = dst_buf_ctx->dev_buffer;
vk_subbuffer src0_buf = ggml_vk_tensor_subbuffer(ctx, src0, op_supports_incontiguous);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op_supports_incontiguous isn't quite the same as allow_misalign, which depends on the shader. But maybe this is similar to what the original code did, at least based on the size handling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes op_supports_incontiguous are shaders which support (some but not always all) permutation and/or views with offsets. I don't think there is anything more fine-granular.

I think it can be removed in favour of checking those things in ggml_backend_vk_device_supports_op, and rejecting ops with misaligned tensors-views there if the shader doesn't support it. (But it's for another PR perhaps)

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is a nice change.

@0cc4m 0cc4m merged commit ac76d36 into ggml-org:master Nov 7, 2025
69 of 71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants