Skip to content

Advanced buffer/texture update mechanism #572

@IAmNotHanni

Description

@IAmNotHanni

Is your feature request related to a problem?

Currently, we update buffers and textures with a simple update code where the pipeline barriers are not really batched and neither are cache flushes. We can recorder the update code and also account for the frames in flight to achieve much better performance.

Description

In a simple picture, we currently update buffers like this:

for each render module
   for each buffer in the current render module
      if(update_is_required)
         destroy_buffer()
         create_buffer(create as MAPPED) // with the new size, maybe even the type changes(?)
         if(memory is HOST_VISIBLE)
            // The allocation ended up in mappable memory and is already mapped!
            // Update mapped memory simply by using memcpy
            memcpy()
            if(memory is not HOST_COHERENT)
               flush_cache(buffer) // only required if caches are not flushed automatically (=HOST_COHERENT)
               // NOTE: If we would support readback from gpu to cpu, we need invalidate_mapped_memory_ranges here!
            pipeline_barrier(buffer) // Wait for copy operation to be finished
         else // not mappable memory
            destroy_staging_buffer(); // Every buffer has a staging buffer associated with it
            create_staging_buffer(create as MAPPED);
            // Copy the data into the staging buffer
            memcpy()
            if(staging buffer memory is not HOST_COHERENT)
               flush_cache(staging_buffer); // only required if caches are not flushed automatically (=HOST_COHERENT)
            pipeline_barrier(staging_buffer) // Wait for copy operation into the staging buffer to be finished
            // we already have a command buffer in recording state here btw
            vkCmdCopyBuffer(staging_buffer, buffer) // Copy from staging buffer into the actual buffer
            pipeline_barrier(buffer) // Wait for copy operation from staging buffer into the actual buffer to finish
            // The staging buffer must stay valid until the command buffer has been submitted, it will be destroyed in next iteration automatically

A similar update mechanism is used for textures, but the main different is that they always require a staging buffer and a vkCmdCopyBufferToImage command, together with additional barriers for pipeline layout transitions.

How to improve this?

General strategy: We should im for batching calls to vkCmdPipelineBarrier as much as possible and we should also batch calls to vkFlushMappedMemoryRanges. Note that we only need to flush mapped memory ranges if we write from cpu to gpu and the memory is not HOST_COHERENT. Furthermore, if we would implement readback from gpu to cpu, we would also need to do a vkInvalidateMappedMemoryRanges! We don't support readback from gpu to cpu currently.

Note that the first step of any buffer or texture update involves a memcpy(), either because the (buffer) memory is HOST_VISIBLE and can be updated through memcpy() directly, or we need to create and fill a staging buffer for the buffer or texture update. This means we can loop through all buffers and textures and create them, perform the memcpy() for each one of them, and store the data required for a pipeline barrier after the memcpy along with the data required for vkFlushMappedMemoryRanges (if required in case the memory is not HOST_COHERENT).

After this loop (for both buffers and textures), we can place one batched call to vkCmdPipelineBarrier and one batched call to vkFlushMappedMemoryRanges.

For buffers which are HOST_VISIBLE, the update is already finished at that stage. We now need to focus on the buffers and textures which require a copy command. The buffers need one pipeline barrier after the vkCmdCopyBuffer, and the textures need two for the image layout transitions before and after calling vkCmdCopyBufferToImage.

From what I understand, we can batch the buffer memory barrier for the buffers after the vkCmdCopyBuffer with the image layout transition barrier before vkCmdCopyBufferToImage, but we can't really batch all 3 barriers into one call of vkCmdPipelineBarrier I guess (I might be wrong about this!).

This means we need to place one or two calls to vkCmdPipelineBarrier towards the end here. In total, we have batched all calls to vkCmdPipelineBarrier into 3 (or only 2?) calls, and we batched all calls to vkFlushMappedMemoryRanges into only one call! This should significantly improve performance.

The final code should look something like this:

vector<PipelineBarrier> batch1
vector<MappedMemoryRange> ranges1
for every rendermodule
   for every buffer in the current rendermodule
      if(update_is_required)
         destroy_buffer()
         create_buffer()
         if(memory is HOST_VISIBLE)
            memcpy()
            // NOTE: If we would support readback from gpu to cpu, we need invalidate_mapped_memory_ranges here!
            ranges1.add(buffer_range)
            batch1.add(buffer_barrier)
         else
            destroy_staging_buffer()
            create_staging_buffer()
            memcpy()
            ranges1.add(staging_buffer_range)
            batch1.add(staging_buffer_barrier)

   for every texture in the current rendermodule
      if(update_is_required)
         destroy_staging_buffer()
         create_staging_buffer()
         memcpy()
         ranges1.add(staging_buffer_range)
         batch1.add(staging_buffer_barrier) // This is the image layout transtion barrier to transfer dst really

// Both is batched for all buffers and textures in all rendermodules, should be performant!
vkFlushMappedMemoryRanges(ranges1)
vkCmdPipelineBarriers(batch1)

vector<PipelineBarrier> batch2
for every rendermodule
   for every buffer in the current rendermodule
      if(update_is_required)
         if(not HOST_VISIBLE) // TODO: We should store the indices of buffers which require update this way earlier already...
            // This is where we left off, we created the staging buffer
            vkCmdCopyBuffer(staging_buffer, buffer)
            batch2.add(buffer_memory_barrier)
   
   for every texture in the current rendermodule
      if(update_is_required) // TODO: Remember earlier which textures need an update this way, store indices?
         vkCmdCopyBufferToImage(´staging_buffer, image)
         batch2.add(image_memory_barrier) // Image layout transition to shader read optimal

// Another batched call, should be very performant
vkCmdPipelineBarriers(batch2)

How does this connect to the frames in flight?

  • The create_buffer method of the buffer wrapper (and similar code in the texture wrapper) should get the current frame in flight index to access the correct buffer in the array for the current frame in flight. This happens all automatically, and not even rendergraph (or external code) should need worry about it.

Alternatives

If we keep the update mechanism as it is, we place a lot more barriers than needed.

Affected Code

The rendergraph code for buffer and texture management

Operating System

All

Additional Context

None

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions