Metal: Fixes for dynamic buffers #3

stuartcarnie · 2025-03-30T05:06:11Z

Resolves all issues under Metal.

Note that I was required adding a couple of extra arguments to the uniform_sets_get_dynamic_offsets API, so that Metal can access the related shader and determine the layout of the offsets.

In summary:

My final implementation is similar to my original, in that it does not update argument buffers unnecessarily, and only updates the slots containing the dynamic buffers. I use the p_dynamic_offsets that you added, which is captured by the render graph.
My implementation removes reduces unnecessary allocations during dynamic buffer updates and when binding uniforms sets, in favour of baking the dynamic offsets when creating the shaders from byte code.
Tested tps-demo with argument buffers and slot binding modes, and no more artefacts.

stuartcarnie · 2025-03-30T05:07:57Z

servers/rendering/rendering_device_driver.h

-	virtual uint32_t uniform_sets_get_dynamic_offsets(VectorView<UniformSetID> p_uniform_sets, uint32_t p_set_count) const = 0;
+	virtual uint32_t uniform_sets_get_dynamic_offsets(VectorView<UniformSetID> p_uniform_sets, ShaderID p_shader, uint32_t p_first_set_index, uint32_t p_set_count) const = 0;


It is necessary that we know what shader and the starting set index, so that Metal can derive the correct offsets

stuartcarnie · 2025-03-30T05:09:35Z

drivers/metal/rendering_device_driver_metal.mm

-	bool disable_argument_buffers = false;
-	if (String v = OS::get_singleton()->get_environment(U"GODOT_DISABLE_ARGUMENT_BUFFERS"); v == U"1") {
-		disable_argument_buffers = true;
-	}
-
-	if (device_properties->features.argument_buffers_tier >= MTLArgumentBuffersTier2 && !disable_argument_buffers) {
+	if (device_properties->features.max_buffers_tier >= MTLArgumentBuffersTier2) {


I moved this logic to metal_device_properties.mm, so that it doesn't check each time a shader is compiled.

stuartcarnie · 2025-03-30T05:11:13Z

drivers/metal/metal_objects.mm

-		render.dirty.set_flag(RenderState::DIRTY_UNIFORMS);
-		render.uniform_set_mask |= UINT64_MAX; // We don't know which ones are dirty. Potentially all of them.
-		render.dynamic_offsets_mask = p_dynamic_offsets;
+	render.dynamic_offsets |= p_dynamic_offsets;


Multiple calls to render_bind_uniform_sets may occur before the draw call, so we need merge the value of p_dynamic_offsets. Each nibble refers to the frame_index captured at bind time, which you added in your recent commit.

stuartcarnie · 2025-03-30T05:12:08Z

drivers/metal/metal_objects.mm

-		}
-
-		if (render.uniform_sets[index] != set) {
+		if (render.uniform_sets[index] != set || layout.get_count(index) > 0) {


layout.get_count(index) returns how many dynamic buffers are used for index (uniform set index). If it is > 0 then we need to mark the uniform set as dirty.

stuartcarnie · 2025-03-30T05:23:51Z

drivers/metal/metal_objects.mm

+		if (bs.is_dynamic()) {
+			update_dynamic_uniforms(p_shader, p_resource_usage, p_set_index, bs, p_dynamic_offsets, p_frame_idx);
+		}


This implementation is more efficient, as it only updates the offsets for dynamic buffers.

A single large MTLBuffer is allocated to store the argument tables, that is sized by rendering/rendering_device/vsync/frame_queue_size * argument_table_size. The previous approach would allocate a separate MTLBuffer for every combination of frame indexes. For example, if the rendering/rendering_device/vsync/frame_queue_size is 2 and there are 3 dynamic buffers in the argument table, that would allocate and copy 6 distinct MTLBuffer instances. If the frame_queue_size is 3, it would result in 27 separate buffer allocations and distinct argument tables. This approach limits it to a single buffer and only frame_queue_size * argument_table_size vs (for example) 27 * argument_table_size.

The pathological case would be the maximum number of dynamic buffers (8) and max frame_queue_size (3), which would result in up to 3⁸ (6,561) individual allocations and associated bookkeeping.

darksylinc · 2025-04-07T16:32:21Z

Hi Stuart!

Sorry for getting back to you so late.

I looked into your solution, but it reintroduces one of the 2 original bugs that my changes tried to address.

It's summarized by your explanation:

If the frame_queue_size is 3, it would result in 27 separate buffer allocations and distinct argument tables. This approach limits it to a single buffer and only frame_queue_size * argument_table_size vs (for example) 27 * argument_table_size.

That is the problem.

We can't have just 3. We might need more than 3.
Of course worst-case scenario we need up to 27 separate buffers (or 27x the size of a single buffer). Most likely 3 is enough; and we might need 4 or 5. But the Metal code is not handling the case.

darksylinc · 2025-04-07T18:54:29Z

Note that treating the edge case as an edge case is fine, e.g. using a temporary scratch buffer to handle these edge cases is ok.

stuartcarnie · 2025-04-07T19:29:24Z

@darksylinc no worries!

We might need more than 3.

I don't understand why or how we can need more than 3. Given our frame_queue_size is 3, that means at most there will be 3 inflight frames for the GPU to execute. This means there can only be 3 combinations of any number of argument buffers in flight also. Godot stalls if it begins a frame that has a fence:

https://github.com/godotengine/godot/blob/1f9eaae875af253aa0749252f0e7f70510615b1f/servers/rendering/rendering_device.cpp#L6530-L6535

which occurs after executing every frame:

https://github.com/godotengine/godot/blob/1f9eaae875af253aa0749252f0e7f70510615b1f/servers/rendering/rendering_device.cpp#L6513-L6515

Therefore, if it reaches 3, it won't add a 4th or 5th combination in-flight (etc). What am I missing as to how it is possible more than 3 combinations of argument tables are in flight. Once the GPU executes the frame, that argument table buffer is no longer needed.

Note

Metal will still increase the buffer size if we add support for a greater number in-flight frames in the future

darksylinc · 2025-04-07T22:08:28Z

I don't understand why or how we can need more than 3. Given our frame_queue_size is 3, that means at most there will be 3 inflight frames for the GPU to execute. This means there can only be 3 combinations of any number of argument buffers in flight also. Godot stalls if it begins a frame that has a fence

Short version: Because we might send different frame_idx values during the same frame.

Example:

I'm going to talk in pseudo code.
I'll assume frame_queue_size = 3
Assume that everything else is set up correctly (e.g. relevant PSO is bound correctly, vertex buffers, etc)

RID buffer;
RID uniform_set;

void init()
{
  buffer = rd->buffer_create( ... ); // persistent
  uniform_set = rd->uniform_set_create( ... ); // contains "buffer" somewhere
}

// Gets called once per frame
void frame_iterate()
{
  if( frame > 0u ) // frame is in range [0; inf)
  {
     rd->draw_list_begin();
     // We're using without uploading anything. i.e. it contains the data from previous frames.
     // This is a code smell and probably a bad idea, but it's not invalid.
     rd->draw_list_bind_uniform_set(draw_list, uniform_set, 0);
     rd->draw_list_draw(draw_list, ..., instance_count);
     rd->draw_list_end();
  }

   uint8_t *data = rd->buffer_persistent_map_advance(buffer);
   memcpy( data, src, size );

   rd->draw_list_begin();
   rd->draw_list_bind_uniform_set(draw_list, uniform_set, 0);
   rd->draw_list_draw(draw_list, ..., instance_count);
   rd->draw_list_end();
}

Now this edge case is quite special:

It's rare, and probably a bad idea. May be a bug in the higher level code. But it's valid usage.
buffer_persistent_map_advance() gets called no more than once per frame (it is invalid to call buffer_persistent_map_advance more than once per frame).
The first draw_list_bind_uniform_set() call sees frame_idx = n, but the second draw_list_bind_uniform_set() call sees frame_idx = n + 1.
- n is in range [0; frame_queue_size). When I say n + 1 I actually mean (n + 1) % frame_queue_size
Both draw_list_bind_uniform_set() belong to the same frame.

Your Metal code will write frame_idx = n into the argument buffer, but then will overwrite that with frame_idx = n + 1 and submit to GPU. As a result, both draws will use n+1. Which is incorrect.

Sidenote:
Since PR godotengine#99257 Godot now submits twice per frame: The first time to render most stuff; the second time to render to the window. This helps lower bubbles waiting for swapchains to be released (see "Do not wait so long for swapchain" in the PR).

Thus the CPU while preparing for 2nd submit could potentially write to an argument buffer (I'm not just talking about persistent bit / dynamic buffers) while the GPU is also reading from that same buffer for the 1st submit. I don't know how Apple HW would respond to that race condition when the CPU is overwriting bytes with the same value as before (hopefully, nothing happens at all; but at worse some alignment issue kicks in and the write from CPU is not atomic, corrupting the argument buffer the GPU is trying to read).

stuartcarnie · 2025-04-08T04:21:10Z

I see, in a single submission, the argument buffer could be mutated multiple times before the command buffer is submitted to the GPU. Note that it won't be mutated whilst the GPU is using it; however, as you said, it will refer to the last dynamic buffer, which may be wrong if your scenario is encountered.

I'll look at some options for that, and perhaps treat it as an edge case, and use a MTLHeap to allocate the argument buffers.

Thanks for the explanation, @darksylinc!

Metal: Fixes for dynamic buffers

ed68ff6

stuartcarnie mentioned this pull request Mar 30, 2025

Add Persistent Buffers and Push Constants godotengine/godot#103779

Draft

stuartcarnie commented Mar 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal: Fixes for dynamic buffers #3

Metal: Fixes for dynamic buffers #3

stuartcarnie commented Mar 30, 2025

stuartcarnie Mar 30, 2025

stuartcarnie Mar 30, 2025

stuartcarnie Mar 30, 2025

stuartcarnie Mar 30, 2025

stuartcarnie Mar 30, 2025

stuartcarnie Mar 30, 2025

darksylinc commented Apr 7, 2025

darksylinc commented Apr 7, 2025

stuartcarnie commented Apr 7, 2025 •

edited

Loading

darksylinc commented Apr 7, 2025 •

edited

Loading

stuartcarnie commented Apr 8, 2025

		virtual uint32_t uniform_sets_get_dynamic_offsets(VectorView<UniformSetID> p_uniform_sets, uint32_t p_set_count) const = 0;
		virtual uint32_t uniform_sets_get_dynamic_offsets(VectorView<UniformSetID> p_uniform_sets, ShaderID p_shader, uint32_t p_first_set_index, uint32_t p_set_count) const = 0;

Metal: Fixes for dynamic buffers #3

Are you sure you want to change the base?

Metal: Fixes for dynamic buffers #3

Conversation

stuartcarnie commented Mar 30, 2025

stuartcarnie Mar 30, 2025

Choose a reason for hiding this comment

stuartcarnie Mar 30, 2025

Choose a reason for hiding this comment

stuartcarnie Mar 30, 2025

Choose a reason for hiding this comment

stuartcarnie Mar 30, 2025

Choose a reason for hiding this comment

stuartcarnie Mar 30, 2025

Choose a reason for hiding this comment

stuartcarnie Mar 30, 2025

Choose a reason for hiding this comment

darksylinc commented Apr 7, 2025

darksylinc commented Apr 7, 2025

stuartcarnie commented Apr 7, 2025 • edited Loading

darksylinc commented Apr 7, 2025 • edited Loading

stuartcarnie commented Apr 8, 2025

stuartcarnie commented Apr 7, 2025 •

edited

Loading

darksylinc commented Apr 7, 2025 •

edited

Loading