Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan GPU/CPU synchronization is wrong #80550

Closed
darksylinc opened this issue Aug 12, 2023 · 6 comments
Closed

Vulkan GPU/CPU synchronization is wrong #80550

darksylinc opened this issue Aug 12, 2023 · 6 comments

Comments

@darksylinc
Copy link
Contributor

Godot version

4.2.x master [16a9356]

System information

Godot v4.2.dev (262d1ea) - Ubuntu 20.04.6 LTS (Focal Fossa) - X11 - Vulkan (Forward+) - dedicated AMD Radeon RX 6800 XT - AMD Ryzen 9 5900X 12-Core Processor (24 Threads)

Issue description

"OOOFFF"

That's the only way I can describe this ticket.

It starts as a simple CPU to GPU synchronozation bug but there is also another set of wrong assumptions and bugs which, when combined, "more or less" synchronization ends up being correct.

There are so many things going on I don't know where to start, so I'm going to write this in the order in which I discovered these problems.

At worst this is just a waste a VRAM, at best suddenly many of those DEVICE_LOST start making sense.

A lone validation error that shouldn't have happened

It all started with a random validation error saying a vkDestroyRenderPass got called while it was still in use while trying to repro #80286 on AMD, an error that I could not reproduce ever again.

So I went looking for vkDestroyRenderPass calls in the codebase and luckily they weren't many.

I'm still evaluating everything to understand what's going on. But it is clear it's wrong (there's no way to sugar coat it).

That's when I noticed the function RenderingDeviceVulkan::swap_buffers does the following:

void RenderingDeviceVulkan::swap_buffers() {
	context->swap_buffers();
	frame = (frame + 1) % frame_count;
	_begin_frame();
}

// And inside _begin_frame we call _free_pending_resources( frame )
void RenderingDeviceVulkan::_free_pending_resources(int p_frame) {
	// Free in dependency usage order, so nothing weird happens.
	// Pipelines.
	while (frames[p_frame].render_pipelines_to_dispose_of.front()) {
		RenderPipeline *pipeline = &frames[p_frame].render_pipelines_to_dispose_of.front()->get();

		vkDestroyPipeline(device, pipeline->pipeline, nullptr);

		frames[p_frame].render_pipelines_to_dispose_of.pop_front();
	}

      // .. and lots of other vkDestroy* calls
}

Where's the wait?.

That's wrong. The code should've been something along:

context->swap_buffers();
frame = (frame + 1) % frame_count;
vkWaitForFences(device, 1, &fences[frame]);
_begin_frame();

I was about to shout 'Eureka!', but went into swap_buffers() just in case there was already a wait there (although it would be some weird math because it would have to be vkWaitForFences(device, 1, &fences[frame + 1])).

That's when it got worse.

An inverted wait

I couldn't find anything in swap_buffers, but I was wondering where is the vkWaitForFences call???

I found it in VulkanContext::prepare_buffers. But sadly it performs the following:

vkWaitForFences(device, 1, &fences[frame_index], VK_TRUE, UINT64_MAX);
vkResetFences(device, 1, &fences[frame_index]);
vkQueueSubmit(graphics_queue, 1, &submit_info, fences[frame_index]);

frame_index += 1;
frame_index %= FRAME_LAG;

This is wrong.

In case someone wonders, for data that is used to communicate CPU from/to GPU we are supposed to have at least 2 pools of memory:

CPU writes to pool 0 while GPU is reading from pool 1, and CPU writes to pool 1 while GPU is reading from pool 0. If the GPU is still reading from pool 0 and CPU wants now to write to pool 0 (it's done writing to 1), it needs to stall and wait.

This is a standard double buffer scheme. A triple buffer scheme is the same, but there is an extra pool to minimize potential stalls.

Unfortunately with Godot's current code the following sequence is what will happen after running for several frames.

Action Remarks
vkWaitForFences on 0
vkResetFences on 0
vkQueueSubmit on 0
frame_index = 1;
Fill gpu buffers on 1
vkWaitForFences on 1
vkResetFences on 1
vkQueueSubmit on 1
frame_index = 0;
Fill gpu buffers on 0 Oops!! We never waited to see if fence[0] is done
vkWaitForFences on 0
vkResetFences on 0
vkQueueSubmit on 0

And thus we're not actually synchronizing anything! The CPU keeps writing to data that may be in use by the GPU.

A simple possible solution is the following:

vkResetFences(device, 1, &fences[frame_index]);
vkQueueSubmit(graphics_queue, 1, &submit_info, fences[frame_index]);

frame_index += 1;
frame_index %= FRAME_LAG;
vkWaitForFences(device, 1, &fences[frame_index], VK_TRUE, UINT64_MAX);

For best performance it's often suggested to delay the vkWaitForFences() as much as possible (i.e. until it's actually needed for us to write to a buffer from CPU, or destroy a resource, etc). But it would seem given Godot's code chaotic complexity this is the only safe solution that won't result in more hazards.

Sadly, the report doesn't end here. There is more. Godot doesn't have 2 pools, it's got 4.

Wrong assumptions on Swapchains

RenderingDeviceVulkan::frame_count is set to the following:

// Always need one extra to ensure it's unused at any time, without having to use a fence for this.
frame_count = p_context->get_swapchain_image_count() + 1;

Note: frame_count is almost always 4

That comment is HUGE. This is a false assumption.

Neither vkAcquireNextImageKHR nor vkQueuePresentKHR have an obligation to synchronize the CPU.

They use VkSemaphores, which synchronize GPU work with other GPU work (e.g. so that a window isn't presented to the screen while the GPU is still working on it, producing a similar effect to when VSync is off).

However in practice most (all???) drivers do stall the CPU due to limitations in SW or HW architectures. Linux+X11 is notorious for stalling inside vkAcquireNextImageKHR, but other drivers prefer to stall inside vkQueuePresentKHR instead.

Since we are debunking false assumptions, we also don't have the guarantee that vkAcquireNextImageKHR will return swapchain indices in order (e.g. 0, 1, 2, 3, 0, 1, 2, 3). It could return them out of order or even return the same set multiple times (e.g. 2, 0, 2, 0, 1).

We could set the value frame_count to as high as 30, and we still won't have the guarantee that CPU and GPU work will be synchronized. We need a fence.

Swapchains should not dictate the number of pools

As I said earlier, on a double buffer scheme, we should have 2 pools of memories (FRAME_LAG = 2).

But because of the wrong assumption on swapchains, Godot keeps 4 pools of memories (because frame_count = 4 on most machines).

Swapchain count should not be used for anything else other than tracking swapchain related data like VkRenderPass.

Combine these 2 bugs and it works

So we have 2 competing implementations and 2 bugs:

  • VulkanContext::prepare_buffers uses vkWaitForFences but the calls are out of order
    • This is what's causing the CPU to wait for the GPU.
    • This should be used to track the 2 pools.
  • VulkanContext uses swapchain count to use 4 pools instead of 2.

I've ran simulations in Spreadsheets of what happens when we run our current vkWaitForFences code with just 2 values and a frame_count = 4 and it would seem that this just happens to synchronize correctly, a happy accident.

Because Godot will stall when starting the 3rd frame, and it keeps 4 pools of data, the data should be safe for access on the 3rd frame.

However the complexity of all this is too large, so it's possible that I missed something and we may still get a race condition (that lone vkDestroyRenderPass error that sparked this invegistation must've come from somewhere!), just a rare one.

I also noticed that if, for some God-forsaken reason, frame_count = 3 or lower, then we will have race conditions.

I analyzed the surrounding code behind frame_count setup, and this is possible but it is improbable.
We set desiredNumOfSwapchainImages = 3 and I don't think there are implementations that have surfCapabilities.maxImageCount = 2 however I could be wrong. Also, if the driver returns 2 in vkGetSwapchainImagesKHR despite us asking for at least 3 (which AFAIK would be a driver bug), we're still in trouble.

However it would make sense that those users who can repro DEVICE_LOST with high frequency happen to have frame_count <= 3.

I didn't analyze what happens when separate_present_queue == true. In fact, this is so rare that I don't think this code has been tested at all, and we have no way to verify yet if the bug reports have been using this.

Steps to reproduce

This bug is rare to repro, but it should be much easier to repro if we force frame_count to a value lower than 4.

Minimal reproduction project

None.

@akien-mga akien-mga added this to the 4.2 milestone Aug 12, 2023
@akien-mga akien-mga changed the title GPU/CPU synchronization is wrong Vulkan GPU/CPU synchronization is wrong Aug 12, 2023
@Calinou
Copy link
Member

Calinou commented Aug 12, 2023

We should lower the default frame_count in the long run (and possibly expose a project setting to adjust it), as people have reported latency issues in Vulkan that don't occur with OpenGL.

@darksylinc
Copy link
Contributor Author

darksylinc commented Aug 12, 2023

We should lower the default frame_count in the long run (and possibly expose a project setting to adjust it), as people have reported latency issues in Vulkan that don't occur with OpenGL.

I saw the other thread about latency and I forgot to comment on this. The current implementation is too wrong to be discussing it.

Godot right now is doing double buffering (FRAME_LAG = 2). But because of the bug, it is actually somewhere between triple and double buffer (because we write to buffer[2] and right afterwards we wait on fence[0]).

Double buffering is the lowest latency that can be reached with reasonable performance. Also frame_count doesn't matter much as long as frame_count > FRAME_LAG.

In fact in my PR I will have rename frame_count because it's wrong and confusing.

Perhaps after the problem is fixed the latency problems are gone. But if not, we will have the proper code to test other values (for instance, FRAME_LAG should not be hardcoded).

Using the current nomenclature, we should always request frame_count = FRAME_LAG + 1. The value that should be user-tweakable is FRAME_LAG

darksylinc added a commit to darksylinc/godot that referenced this issue Sep 16, 2023
The original code had two bugs:

1. It assumed Swapchain count can be used for proper CPU/GPU
synchronization (it's not)
2. It used a double buffer scheme (FRAME_LAG = 2) but the
vkWaitForFences/vkResetFences calls were in the wrong order and in the
wrong place
3. The swapchain bug forced a quad-buffer scheme with roughly 1 or 2
frames of wait; which hid all the issues

This commit gets rid of RenderingDeviceVulkan::frame_count &
RenderingDeviceVulkan::frame which should've never existed and makes
everything depend on VulkanContext::frame_index &
VulkanContext::FRAME_LAG

Add ability to tweak FRAME_LAG from within Project Settings
Adds new setting display/window/vsync/buffer_count
Adds new setting display/window/vsync/swapchain_count

See godotengine#80550 for details

------------------------------------------------------------------

Fix comment grammar
Use Godot's functions instead of std

------------------------------------------------------------------

Additionally handle inconsistent state when destroying windows:
window.image_acquired_semaphores[i] contains uninitialized values.

This means that if VulkanContext::_update_swap_chain had failed early
(this can happen for valid reasons and due to unexpected failures) then
_clean_up_swap_chain will try to destroy an invalid handle.
This fix always sets the handle to VK_NULL_HANDLE.

Likewise perform the checks in various other resources in
_clean_up_swap_chain in case the resources were partially initialized
and are now in an inconsistent state. The API should guarantee the call
to be valid if the handle was VK_NULL_HANDLE, but I don't trust driver
implementations to handle that case.

Affects godotengine#81670

Co-authored-by: Clay John <claynjohn@gmail.com>
darksylinc added a commit to darksylinc/godot that referenced this issue Sep 23, 2023
The original code had two bugs:

1. It assumed Swapchain count can be used for proper CPU/GPU
synchronization (it's not)
2. It used a double buffer scheme (FRAME_LAG = 2) but the
vkWaitForFences/vkResetFences calls were in the wrong order and in the
wrong place
3. The swapchain bug forced a quad-buffer scheme with roughly 1 or 2
frames of wait; which hid all the issues

This commit gets rid of RenderingDeviceVulkan::frame_count &
RenderingDeviceVulkan::frame which should've never existed and makes
everything depend on VulkanContext::frame_index &
VulkanContext::FRAME_LAG

Add ability to tweak FRAME_LAG from within Project Settings
Adds new setting display/window/vsync/buffer_count
Adds new setting display/window/vsync/swapchain_count

See godotengine#80550 for details

------------------------------------------------------------------

Fix comment grammar
Use Godot's functions instead of std

------------------------------------------------------------------

Additionally handle inconsistent state when destroying windows:
window.image_acquired_semaphores[i] contains uninitialized values.

This means that if VulkanContext::_update_swap_chain had failed early
(this can happen for valid reasons and due to unexpected failures) then
_clean_up_swap_chain will try to destroy an invalid handle.
This fix always sets the handle to VK_NULL_HANDLE.

Likewise perform the checks in various other resources in
_clean_up_swap_chain in case the resources were partially initialized
and are now in an inconsistent state. The API should guarantee the call
to be valid if the handle was VK_NULL_HANDLE, but I don't trust driver
implementations to handle that case.

Affects godotengine#81670

Co-authored-by: Clay John <claynjohn@gmail.com>
@AThousandShips AThousandShips modified the milestones: 4.2, 4.3 Oct 30, 2023
@TestSubject06
Copy link
Contributor

Is this issue still on the blocks for 4.3, or is this now moved to 4.x? This seems deeply related to the AMD issues we're seeing with frame shuffling when vsync is enabled, especially on 5700 XTs.

@TestSubject06
Copy link
Contributor

I tested darksylinc's build of 4.2 tagged 'binary for testers' and still saw the stutter and shuffling issues on my Radeon 5700 XT.

The Vulkan renderer also frequently crashes and corrupts the entire display leading to an irrecoverable driver crash that requires a complete system shutdown. Unsure if it's related.

darksylinc added a commit to darksylinc/godot that referenced this issue Jan 14, 2024
The original code had two bugs:

1. It assumed Swapchain count can be used for proper CPU/GPU
synchronization (it's not)
2. It used a double buffer scheme (FRAME_LAG = 2) but the
vkWaitForFences/vkResetFences calls were in the wrong order and in the
wrong place
3. The swapchain bug forced a quad-buffer scheme with roughly 1 or 2
frames of wait; which hid all the issues

This commit gets rid of RenderingDeviceVulkan::frame_count &
RenderingDeviceVulkan::frame which should've never existed and makes
everything depend on VulkanContext::frame_index &
VulkanContext::FRAME_LAG

Add ability to tweak FRAME_LAG from within Project Settings
Adds new setting display/window/vsync/buffer_count
Adds new setting display/window/vsync/swapchain_count

See godotengine#80550 for details

------------------------------------------------------------------

Fix comment grammar
Use Godot's functions instead of std

------------------------------------------------------------------

Additionally handle inconsistent state when destroying windows:
window.image_acquired_semaphores[i] contains uninitialized values.

This means that if VulkanContext::_update_swap_chain had failed early
(this can happen for valid reasons and due to unexpected failures) then
_clean_up_swap_chain will try to destroy an invalid handle.
This fix always sets the handle to VK_NULL_HANDLE.

Likewise perform the checks in various other resources in
_clean_up_swap_chain in case the resources were partially initialized
and are now in an inconsistent state. The API should guarantee the call
to be valid if the handle was VK_NULL_HANDLE, but I don't trust driver
implementations to handle that case.

Affects godotengine#81670

Co-authored-by: Clay John <claynjohn@gmail.com>
@akien-mga
Copy link
Member

@DarioSamo @darksylinc Can this be considered fixed by #87340, or only partially? If so, what's the current status?

@DarioSamo
Copy link
Contributor

DarioSamo commented Feb 14, 2024

The issue described here is solved, as the new synchronization order in the context rewrite is what Matias described as the correct solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants