Skip to content

Potential performance issues raised by Best Practice and Performance validation layers #1523

@AnyOldName3

Description

@AnyOldName3

To try and discover why vkBeginCommandBuffer is taking much longer than I consider reasonable, I ran vsgviewer with some of the optional extra validation layers active. Here are some of the things it reported:

BestPractices-deprecated-extension(WARN / SPEC): msgNum: -628989766 - Validation Warning: [ BestPractices-deprecated-extension ] | MessageID = 0xda8260ba | vkCreateInstance(): Attempting to enable deprecated extension VK_KHR_get_physical_device_properties2, but this extension has been promoted to 1.1.0 (0x00401000).

As we pass the Vulkan version from the window traits to vkEnumerateInstanceVersion, it can be raised to a version after the VK_KHR_get_physical_device_properties2 was deprecated and replaced by core functionality, so WindowTraits::defaults should check that the version's lower than 1.1.0 before adding the extension to the list. I don't see any way this could affect performance with sane drivers, though.

BestPractices-vkCreateCommandPool-command-buffer-reset(WARN / PERF): msgNum: 141128897 - Validation Performance Warning: [ BestPractices-vkCreateCommandPool-command-buffer-reset ] | MessageID = 0x86974c1 | vkCreateCommandPool(): pCreateInfo->flags has VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT set. Consider resetting entire pool instead.

This looks pretty self-explanatory. It seems some drivers are happier with transient command buffers that are allocated from the pool, used once, then deleted, and happier still with command buffers that are allocated from the pool, used once, then reallocated behind the scenes by resetting their pool. Obviously, this can mean one pool might need splitting up if it was used to allocate command buffers protected by separate fences. At first glance, this seems to be pretty likely to be related to the problem that prompted me to investigate in the first place.

BestPractices-AMD-CreatePipelinesLayout-KeepLayoutSmall(WARN / PERF): msgNum: -2011419003 - Validation Performance Warning: [ BestPractices-AMD-CreatePipelinesLayout-KeepLayoutSmall ] | MessageID = 0x881c2e85 | vkCreatePipelineLayout(): [AMD] pipeline layout size is too large. Prefer smaller pipeline layouts.Descriptor sets cost 1 DWORD each. Dynamic buffers cost 2 DWORDs each when robust buffer access is OFF. Dynamic buffers cost 4 DWORDs each when robust buffer access is ON. Push constants cost 1 DWORD per 4 bytes in the Push constant range. 

The pipeline layout this is logged in relation to only has two descriptor set layouts, but it's got the full 128 byte push constant range used for both the view and projection matrix, so the pipeline layout is 2 + 32 DWORDs in size. It seems like moving the projection matrix to the view-dependent descriptor set, as has previously been suggested, might be beneficial for performance.

BestPractices-PipelineBarrier-readToReadBarrier(WARN / PERF): msgNum: 49690623 - Validation Performance Warning: [ BestPractices-PipelineBarrier-readToReadBarrier ] Object 0: handle = 0x2a1426c9ba0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x2f637ff | vkCmdPipelineBarrier(): [AMD] [NVIDIA] Don't issue read-to-read barriers. Get the resource in the right state the first time you use it.
    Objects: 1
        [0] 0x2a1426c9ba0, type: 6, name: NULL

This is triggered by vsg::transferImageData(ref_ptr<ImageView> imageView, VkImageLayout targetImageLayout, Data::Properties properties, uint32_t width, uint32_t height, uint32_t depth, uint32_t mipLevels, const Data::MipmapOffsets& mipmapOffsets, ref_ptr<Buffer> stagingBuffer, VkDeviceSize stagingBufferOffset, VkCommandBuffer commandBuffer, vsg::Device* device) by the second vkCmdPipelineBarrier call in the loop that generates mipmaps. I'm a little confused as to why as the code's pretty similar to the equivalent part of vulkan-tutorial.com's mipmapping tutorial.

BestPractices-SyncObjects-HighNumberOfFences(WARN / PERF): msgNum: -1443561624 - Validation Performance Warning: [ BestPractices-SyncObjects-HighNumberOfFences ] | MessageID = 0xa9f4ff68 | vkCreateFence(): [AMD] [NVIDIA] High number of VkFence objects created. 4 created, but recommended max is 3. Minimize the amount of CPU-GPU synchronization that is used. Each fence has a CPU and GPU overhead cost with it.

Again, this seems fairly self-explanatory.

BestPractices-pipeline-stage-flags-compute(WARN / SPEC): msgNum: -107494158 - Validation Warning: [ BestPractices-pipeline-stage-flags-compute ] Object 0: handle = 0x2a141f75030, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xf997c4f2 | vkQueueSubmit(): pSubmits[0].pWaitDstStageMask[0] using VK_PIPELINE_STAGE_ALL_COMMANDS_BIT
    Objects: 1
        [0] 0x2a141f75030, type: 4, name: NULL

I think it thinks we shouldn't wait for all commands, just the ones that we absolutely have to.

As far as I can tell, all of these are things that will affect any VSG-based application running on drivers that care about these usage patterns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions