-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Is your feature request related to a problem?
Currently, we update buffers and textures with a simple update code where the pipeline barriers are not really batched and neither are cache flushes. We can recorder the update code and also account for the frames in flight to achieve much better performance.
Description
In a simple picture, we currently update buffers like this:
for each render module
for each buffer in the current render module
if(update_is_required)
destroy_buffer()
create_buffer(create as MAPPED) // with the new size, maybe even the type changes(?)
if(memory is HOST_VISIBLE)
// The allocation ended up in mappable memory and is already mapped!
// Update mapped memory simply by using memcpy
memcpy()
if(memory is not HOST_COHERENT)
flush_cache(buffer) // only required if caches are not flushed automatically (=HOST_COHERENT)
// NOTE: If we would support readback from gpu to cpu, we need invalidate_mapped_memory_ranges here!
pipeline_barrier(buffer) // Wait for copy operation to be finished
else // not mappable memory
destroy_staging_buffer(); // Every buffer has a staging buffer associated with it
create_staging_buffer(create as MAPPED);
// Copy the data into the staging buffer
memcpy()
if(staging buffer memory is not HOST_COHERENT)
flush_cache(staging_buffer); // only required if caches are not flushed automatically (=HOST_COHERENT)
pipeline_barrier(staging_buffer) // Wait for copy operation into the staging buffer to be finished
// we already have a command buffer in recording state here btw
vkCmdCopyBuffer(staging_buffer, buffer) // Copy from staging buffer into the actual buffer
pipeline_barrier(buffer) // Wait for copy operation from staging buffer into the actual buffer to finish
// The staging buffer must stay valid until the command buffer has been submitted, it will be destroyed in next iteration automatically
A similar update mechanism is used for textures, but the main different is that they always require a staging buffer and a vkCmdCopyBufferToImage
command, together with additional barriers for pipeline layout transitions.
How to improve this?
General strategy: We should im for batching calls to vkCmdPipelineBarrier
as much as possible and we should also batch calls to vkFlushMappedMemoryRanges
. Note that we only need to flush mapped memory ranges if we write from cpu to gpu and the memory is not HOST_COHERENT
. Furthermore, if we would implement readback from gpu to cpu, we would also need to do a vkInvalidateMappedMemoryRanges
! We don't support readback from gpu to cpu currently.
Note that the first step of any buffer or texture update involves a memcpy(), either because the (buffer) memory is HOST_VISIBLE and can be updated through memcpy() directly, or we need to create and fill a staging buffer for the buffer or texture update. This means we can loop through all buffers and textures and create them, perform the memcpy() for each one of them, and store the data required for a pipeline barrier after the memcpy along with the data required for vkFlushMappedMemoryRanges (if required in case the memory is not HOST_COHERENT).
After this loop (for both buffers and textures), we can place one batched call to vkCmdPipelineBarrier
and one batched call to vkFlushMappedMemoryRanges
.
For buffers which are HOST_VISIBLE, the update is already finished at that stage. We now need to focus on the buffers and textures which require a copy command. The buffers need one pipeline barrier after the vkCmdCopyBuffer
, and the textures need two for the image layout transitions before and after calling vkCmdCopyBufferToImage
.
From what I understand, we can batch the buffer memory barrier for the buffers after the vkCmdCopyBuffer
with the image layout transition barrier before vkCmdCopyBufferToImage
, but we can't really batch all 3 barriers into one call of vkCmdPipelineBarrier
I guess (I might be wrong about this!).
This means we need to place one or two calls to vkCmdPipelineBarrier towards the end here. In total, we have batched all calls to vkCmdPipelineBarrier
into 3 (or only 2?) calls, and we batched all calls to vkFlushMappedMemoryRanges
into only one call! This should significantly improve performance.
The final code should look something like this:
vector<PipelineBarrier> batch1
vector<MappedMemoryRange> ranges1
for every rendermodule
for every buffer in the current rendermodule
if(update_is_required)
destroy_buffer()
create_buffer()
if(memory is HOST_VISIBLE)
memcpy()
// NOTE: If we would support readback from gpu to cpu, we need invalidate_mapped_memory_ranges here!
ranges1.add(buffer_range)
batch1.add(buffer_barrier)
else
destroy_staging_buffer()
create_staging_buffer()
memcpy()
ranges1.add(staging_buffer_range)
batch1.add(staging_buffer_barrier)
for every texture in the current rendermodule
if(update_is_required)
destroy_staging_buffer()
create_staging_buffer()
memcpy()
ranges1.add(staging_buffer_range)
batch1.add(staging_buffer_barrier) // This is the image layout transtion barrier to transfer dst really
// Both is batched for all buffers and textures in all rendermodules, should be performant!
vkFlushMappedMemoryRanges(ranges1)
vkCmdPipelineBarriers(batch1)
vector<PipelineBarrier> batch2
for every rendermodule
for every buffer in the current rendermodule
if(update_is_required)
if(not HOST_VISIBLE) // TODO: We should store the indices of buffers which require update this way earlier already...
// This is where we left off, we created the staging buffer
vkCmdCopyBuffer(staging_buffer, buffer)
batch2.add(buffer_memory_barrier)
for every texture in the current rendermodule
if(update_is_required) // TODO: Remember earlier which textures need an update this way, store indices?
vkCmdCopyBufferToImage(´staging_buffer, image)
batch2.add(image_memory_barrier) // Image layout transition to shader read optimal
// Another batched call, should be very performant
vkCmdPipelineBarriers(batch2)
How does this connect to the frames in flight?
- The
create_buffer
method of the buffer wrapper (and similar code in the texture wrapper) should get the current frame in flight index to access the correct buffer in the array for the current frame in flight. This happens all automatically, and not even rendergraph (or external code) should need worry about it.
Alternatives
If we keep the update mechanism as it is, we place a lot more barriers than needed.
Affected Code
The rendergraph code for buffer and texture management
Operating System
All
Additional Context
None
Metadata
Metadata
Assignees
Labels
Type
Projects
Status