Skip to content

Metal: Fixes for dynamic buffers #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: matias-uma-pc-pr
Choose a base branch
from

Conversation

stuartcarnie
Copy link

Resolves all issues under Metal.

Note that I was required adding a couple of extra arguments to the uniform_sets_get_dynamic_offsets API, so that Metal can access the related shader and determine the layout of the offsets.

In summary:

  • My final implementation is similar to my original, in that it does not update argument buffers unnecessarily, and only updates the slots containing the dynamic buffers. I use the p_dynamic_offsets that you added, which is captured by the render graph.
  • My implementation removes reduces unnecessary allocations during dynamic buffer updates and when binding uniforms sets, in favour of baking the dynamic offsets when creating the shaders from byte code.
  • Tested tps-demo with argument buffers and slot binding modes, and no more artefacts.

Comment on lines -552 to +556
virtual uint32_t uniform_sets_get_dynamic_offsets(VectorView<UniformSetID> p_uniform_sets, uint32_t p_set_count) const = 0;
virtual uint32_t uniform_sets_get_dynamic_offsets(VectorView<UniformSetID> p_uniform_sets, ShaderID p_shader, uint32_t p_first_set_index, uint32_t p_set_count) const = 0;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary that we know what shader and the starting set index, so that Metal can derive the correct offsets

Comment on lines -2099 to +2104
bool disable_argument_buffers = false;
if (String v = OS::get_singleton()->get_environment(U"GODOT_DISABLE_ARGUMENT_BUFFERS"); v == U"1") {
disable_argument_buffers = true;
}

if (device_properties->features.argument_buffers_tier >= MTLArgumentBuffersTier2 && !disable_argument_buffers) {
if (device_properties->features.max_buffers_tier >= MTLArgumentBuffersTier2) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this logic to metal_device_properties.mm, so that it doesn't check each time a shader is compiled.

render.dirty.set_flag(RenderState::DIRTY_UNIFORMS);
render.uniform_set_mask |= UINT64_MAX; // We don't know which ones are dirty. Potentially all of them.
render.dynamic_offsets_mask = p_dynamic_offsets;
render.dynamic_offsets |= p_dynamic_offsets;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple calls to render_bind_uniform_sets may occur before the draw call, so we need merge the value of p_dynamic_offsets. Each nibble refers to the frame_index captured at bind time, which you added in your recent commit.

}

if (render.uniform_sets[index] != set) {
if (render.uniform_sets[index] != set || layout.get_count(index) > 0) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

layout.get_count(index) returns how many dynamic buffers are used for index (uniform set index). If it is > 0 then we need to mark the uniform set as dirty.

Comment on lines +1424 to +1426
if (bs.is_dynamic()) {
update_dynamic_uniforms(p_shader, p_resource_usage, p_set_index, bs, p_dynamic_offsets, p_frame_idx);
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is more efficient, as it only updates the offsets for dynamic buffers.

A single large MTLBuffer is allocated to store the argument tables, that is sized by rendering/rendering_device/vsync/frame_queue_size * argument_table_size. The previous approach would allocate a separate MTLBuffer for every combination of frame indexes. For example, if the rendering/rendering_device/vsync/frame_queue_size is 2 and there are 3 dynamic buffers in the argument table, that would allocate and copy 6 distinct MTLBuffer instances. If the frame_queue_size is 3, it would result in 27 separate buffer allocations and distinct argument tables. This approach limits it to a single buffer and only frame_queue_size * argument_table_size vs (for example) 27 * argument_table_size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pathological case would be the maximum number of dynamic buffers (8) and max frame_queue_size (3), which would result in up to 38 (6,561) individual allocations and associated bookkeeping.

@darksylinc
Copy link
Owner

Hi Stuart!

Sorry for getting back to you so late.

I looked into your solution, but it reintroduces one of the 2 original bugs that my changes tried to address.

It's summarized by your explanation:

If the frame_queue_size is 3, it would result in 27 separate buffer allocations and distinct argument tables. This approach limits it to a single buffer and only frame_queue_size * argument_table_size vs (for example) 27 * argument_table_size.

That is the problem.

We can't have just 3. We might need more than 3.
Of course worst-case scenario we need up to 27 separate buffers (or 27x the size of a single buffer). Most likely 3 is enough; and we might need 4 or 5. But the Metal code is not handling the case.

@darksylinc
Copy link
Owner

Note that treating the edge case as an edge case is fine, e.g. using a temporary scratch buffer to handle these edge cases is ok.

@stuartcarnie
Copy link
Author

stuartcarnie commented Apr 7, 2025

@darksylinc no worries!

We might need more than 3.

I don't understand why or how we can need more than 3. Given our frame_queue_size is 3, that means at most there will be 3 inflight frames for the GPU to execute. This means there can only be 3 combinations of any number of argument buffers in flight also. Godot stalls if it begins a frame that has a fence:

https://github.com/godotengine/godot/blob/1f9eaae875af253aa0749252f0e7f70510615b1f/servers/rendering/rendering_device.cpp#L6530-L6535

which occurs after executing every frame:

https://github.com/godotengine/godot/blob/1f9eaae875af253aa0749252f0e7f70510615b1f/servers/rendering/rendering_device.cpp#L6513-L6515

Therefore, if it reaches 3, it won't add a 4th or 5th combination in-flight (etc). What am I missing as to how it is possible more than 3 combinations of argument tables are in flight. Once the GPU executes the frame, that argument table buffer is no longer needed.

Note

Metal will still increase the buffer size if we add support for a greater number in-flight frames in the future

@darksylinc
Copy link
Owner

darksylinc commented Apr 7, 2025

I don't understand why or how we can need more than 3. Given our frame_queue_size is 3, that means at most there will be 3 inflight frames for the GPU to execute. This means there can only be 3 combinations of any number of argument buffers in flight also. Godot stalls if it begins a frame that has a fence

Short version: Because we might send different frame_idx values during the same frame.

Example:

  1. I'm going to talk in pseudo code.
  2. I'll assume frame_queue_size = 3
  3. Assume that everything else is set up correctly (e.g. relevant PSO is bound correctly, vertex buffers, etc)
RID buffer;
RID uniform_set;

void init()
{
  buffer = rd->buffer_create( ... ); // persistent
  uniform_set = rd->uniform_set_create( ... ); // contains "buffer" somewhere
}

// Gets called once per frame
void frame_iterate()
{
  if( frame > 0u ) // frame is in range [0; inf)
  {
     rd->draw_list_begin();
     // We're using without uploading anything. i.e. it contains the data from previous frames.
     // This is a code smell and probably a bad idea, but it's not invalid.
     rd->draw_list_bind_uniform_set(draw_list, uniform_set, 0);
     rd->draw_list_draw(draw_list, ..., instance_count);
     rd->draw_list_end();
  }

   uint8_t *data = rd->buffer_persistent_map_advance(buffer);
   memcpy( data, src, size );

   rd->draw_list_begin();
   rd->draw_list_bind_uniform_set(draw_list, uniform_set, 0);
   rd->draw_list_draw(draw_list, ..., instance_count);
   rd->draw_list_end();
}

Now this edge case is quite special:

  1. It's rare, and probably a bad idea. May be a bug in the higher level code. But it's valid usage.
  2. buffer_persistent_map_advance() gets called no more than once per frame (it is invalid to call buffer_persistent_map_advance more than once per frame).
  3. The first draw_list_bind_uniform_set() call sees frame_idx = n, but the second draw_list_bind_uniform_set() call sees frame_idx = n + 1.
    • n is in range [0; frame_queue_size). When I say n + 1 I actually mean (n + 1) % frame_queue_size
  4. Both draw_list_bind_uniform_set() belong to the same frame.

Your Metal code will write frame_idx = n into the argument buffer, but then will overwrite that with frame_idx = n + 1 and submit to GPU. As a result, both draws will use n+1. Which is incorrect.

Sidenote:
Since PR godotengine#99257 Godot now submits twice per frame: The first time to render most stuff; the second time to render to the window. This helps lower bubbles waiting for swapchains to be released (see "Do not wait so long for swapchain" in the PR).

Thus the CPU while preparing for 2nd submit could potentially write to an argument buffer (I'm not just talking about persistent bit / dynamic buffers) while the GPU is also reading from that same buffer for the 1st submit. I don't know how Apple HW would respond to that race condition when the CPU is overwriting bytes with the same value as before (hopefully, nothing happens at all; but at worse some alignment issue kicks in and the write from CPU is not atomic, corrupting the argument buffer the GPU is trying to read).

@stuartcarnie
Copy link
Author

I see, in a single submission, the argument buffer could be mutated multiple times before the command buffer is submitted to the GPU. Note that it won't be mutated whilst the GPU is using it; however, as you said, it will refer to the last dynamic buffer, which may be wrong if your scenario is encountered.

I'll look at some options for that, and perhaps treat it as an edge case, and use a MTLHeap to allocate the argument buffers.

Thanks for the explanation, @darksylinc!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants