Skip to content

[RUNTIME][VULKAN] vkBuffer released before memory copy command send to GPU #5388

@samwyi

Description

@samwyi

Problem: When the Vulkan runtime uses deferred kernels to do a GPU -> GPU copy, the source vkBuffer is released before the copy command is put into command buffer. That can cause seg fault in vulkan driver.

Details:
The problem happens in:
args = [ [nd.array(x, ctx=ctx) for x in args] (measure_methods.py: 482).
This line of code rpc calls CopyDataFromTo() (vulkan.cc: 200) to do a GPU -> GPU copy on the remote device. CopyDataFromTo() uses deferred mode to schedule the copy commands. So the copy commands are not really put into vulkan cmd buffer by the end of the rpc call.

Then the source ndarray is destructed because 'args' now points to the dest ndarray. The destructor rpc calls FreeDataSpace() to destroy the source vkBuffer. This caused the vkDestroyBuffer command being scheduled before the vkCmdCopyBuffer command.

In the next line in measure_methods.py, ctx.sync() rpc calls Synchronize(), when the copy commands are actually put into the cmd buffer. But by now, the source buffer is already an invalid. This caused the driver to seg fault while performing the copy.

Reproduce:
Run attached debug_tune.py, or any auto-tuning script using Vulkan and rpc.

Possible Solution:
Add a counter in VulkanBuffer struct. +1 upon a copy request, and -1 after the copy command actually put into cmd buffer. Destroy the buffer only when the counter is 0.

debug_tune.zip

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions