-
Notifications
You must be signed in to change notification settings - Fork 0
Metal: Fixes for dynamic buffers #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: matias-uma-pc-pr
Are you sure you want to change the base?
Metal: Fixes for dynamic buffers #3
Conversation
virtual uint32_t uniform_sets_get_dynamic_offsets(VectorView<UniformSetID> p_uniform_sets, uint32_t p_set_count) const = 0; | ||
virtual uint32_t uniform_sets_get_dynamic_offsets(VectorView<UniformSetID> p_uniform_sets, ShaderID p_shader, uint32_t p_first_set_index, uint32_t p_set_count) const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is necessary that we know what shader and the starting set index, so that Metal can derive the correct offsets
bool disable_argument_buffers = false; | ||
if (String v = OS::get_singleton()->get_environment(U"GODOT_DISABLE_ARGUMENT_BUFFERS"); v == U"1") { | ||
disable_argument_buffers = true; | ||
} | ||
|
||
if (device_properties->features.argument_buffers_tier >= MTLArgumentBuffersTier2 && !disable_argument_buffers) { | ||
if (device_properties->features.max_buffers_tier >= MTLArgumentBuffersTier2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this logic to metal_device_properties.mm
, so that it doesn't check each time a shader is compiled.
render.dirty.set_flag(RenderState::DIRTY_UNIFORMS); | ||
render.uniform_set_mask |= UINT64_MAX; // We don't know which ones are dirty. Potentially all of them. | ||
render.dynamic_offsets_mask = p_dynamic_offsets; | ||
render.dynamic_offsets |= p_dynamic_offsets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple calls to render_bind_uniform_sets
may occur before the draw call, so we need merge the value of p_dynamic_offsets
. Each nibble refers to the frame_index
captured at bind time, which you added in your recent commit.
} | ||
|
||
if (render.uniform_sets[index] != set) { | ||
if (render.uniform_sets[index] != set || layout.get_count(index) > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layout.get_count(index)
returns how many dynamic buffers are used for index
(uniform set index). If it is > 0 then we need to mark the uniform set as dirty.
if (bs.is_dynamic()) { | ||
update_dynamic_uniforms(p_shader, p_resource_usage, p_set_index, bs, p_dynamic_offsets, p_frame_idx); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is more efficient, as it only updates the offsets for dynamic buffers.
A single large MTLBuffer
is allocated to store the argument tables, that is sized by rendering/rendering_device/vsync/frame_queue_size
* argument_table_size
. The previous approach would allocate a separate MTLBuffer
for every combination of frame indexes. For example, if the rendering/rendering_device/vsync/frame_queue_size
is 2 and there are 3 dynamic buffers in the argument table, that would allocate and copy 6 distinct MTLBuffer instances. If the frame_queue_size
is 3, it would result in 27 separate buffer allocations and distinct argument tables. This approach limits it to a single buffer and only frame_queue_size * argument_table_size
vs (for example) 27 * argument_table_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pathological case would be the maximum number of dynamic buffers (8) and max frame_queue_size
(3), which would result in up to 38 (6,561) individual allocations and associated bookkeeping.
Hi Stuart! Sorry for getting back to you so late. I looked into your solution, but it reintroduces one of the 2 original bugs that my changes tried to address. It's summarized by your explanation:
That is the problem. We can't have just 3. We might need more than 3. |
Note that treating the edge case as an edge case is fine, e.g. using a temporary scratch buffer to handle these edge cases is ok. |
@darksylinc no worries!
I don't understand why or how we can need more than 3. Given our which occurs after executing every frame: Therefore, if it reaches 3, it won't add a 4th or 5th combination in-flight (etc). What am I missing as to how it is possible more than 3 combinations of argument tables are in flight. Once the GPU executes the frame, that argument table buffer is no longer needed. Note Metal will still increase the buffer size if we add support for a greater number in-flight frames in the future |
Short version: Because we might send different frame_idx values during the same frame. Example:
RID buffer;
RID uniform_set;
void init()
{
buffer = rd->buffer_create( ... ); // persistent
uniform_set = rd->uniform_set_create( ... ); // contains "buffer" somewhere
}
// Gets called once per frame
void frame_iterate()
{
if( frame > 0u ) // frame is in range [0; inf)
{
rd->draw_list_begin();
// We're using without uploading anything. i.e. it contains the data from previous frames.
// This is a code smell and probably a bad idea, but it's not invalid.
rd->draw_list_bind_uniform_set(draw_list, uniform_set, 0);
rd->draw_list_draw(draw_list, ..., instance_count);
rd->draw_list_end();
}
uint8_t *data = rd->buffer_persistent_map_advance(buffer);
memcpy( data, src, size );
rd->draw_list_begin();
rd->draw_list_bind_uniform_set(draw_list, uniform_set, 0);
rd->draw_list_draw(draw_list, ..., instance_count);
rd->draw_list_end();
} Now this edge case is quite special:
Your Metal code will write Sidenote: Thus the CPU while preparing for 2nd submit could potentially write to an argument buffer (I'm not just talking about persistent bit / dynamic buffers) while the GPU is also reading from that same buffer for the 1st submit. I don't know how Apple HW would respond to that race condition when the CPU is overwriting bytes with the same value as before (hopefully, nothing happens at all; but at worse some alignment issue kicks in and the write from CPU is not atomic, corrupting the argument buffer the GPU is trying to read). |
I see, in a single submission, the argument buffer could be mutated multiple times before the command buffer is submitted to the GPU. Note that it won't be mutated whilst the GPU is using it; however, as you said, it will refer to the last dynamic buffer, which may be wrong if your scenario is encountered. I'll look at some options for that, and perhaps treat it as an edge case, and use a Thanks for the explanation, @darksylinc! |
Resolves all issues under Metal.
Note that I was required adding a couple of extra arguments to the
uniform_sets_get_dynamic_offsets
API, so that Metal can access the related shader and determine the layout of the offsets.In summary: