-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flags to the core queue interface for device-side ring buf/queue descriptor allocation #284
base: amd-staging
Are you sure you want to change the base?
Conversation
FYI @benvanik this is still WIP, but let me know if this works for you and if the cacheable flag works/has benefits. |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
ea66d58
to
9971e7b
Compare
76a3db2
to
fe3307c
Compare
@@ -1272,7 +1272,7 @@ void BlitKernel::PopulateQueue(uint64_t index, uint64_t code_handle, void* args, | |||
std::atomic_thread_fence(std::memory_order_acquire); | |||
queue_buffer[index & queue_bitmask_] = packet; | |||
std::atomic_thread_fence(std::memory_order_release); | |||
if (core::Runtime::runtime_singleton_->flag().dev_mem_queue() && !queue_->needsPcieOrdering()) { | |||
if (queue_->IsDeviceMem() && !queue_->needsPcieOrdering()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saleelk I'm curious if the logic is correct for the original needsPcieOrdering() method you added. Shouldn't this really be:
queue_->needsPcieOrdering()
Meaning we need to change the logic of that call internally?
fe3307c
to
6bbccb0
Compare
6bbccb0
to
436b64a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you!
51dc622
to
4533c8d
Compare
* The queue packet buffer and the queue struct should be allocated in | ||
* the agent's device memory. | ||
*/ | ||
HSA_AMD_QUEUE_FLAG_DEVICE_MEM = (1 << 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explicitly mention whether its cached on uncached. Like HSA_AMD_QUEUE_HOST = 0, HSA_AMD_QUEUE_DEV_UNCACHED = 1 << 0,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the cache flag for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change loosk good except for the flags I mentioned and if we should expose them, and thanks for fixing the typo I had.
* Used to indicate if the queue created in device memory should be | ||
* cacheable. | ||
*/ | ||
HSA_AMD_QUEUE_FLAG_CACHEABLE = (1 << 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably discuss if this is even doable at the moment, to flush L2 is costly and if we are writing a packet (aka 64bytes), we can probably also have false sharing issues/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
4533c8d
to
47357bf
Compare
This builds on a prior change that allowed for allocating a user-mode queue's packet buffer in device memory to also allocate the queue struct in device memory. This provides additional latency benefits particularly for cases where dispatches are performed from the GPU itself. Flags are added to support the various use cases.
47357bf
to
b601fbc
Compare
Adds a bool to the GPU agent and a public member method to check if the GPU supports large BAR. This is needed so we can check if large BAR is supported when a user tries to allocate an AQL queue in device memory on a given GPU agent. Also adds an exception to the AQL queue if device-side AQL queues are requested and the GPU owner of the AQL doesn't support large BAR. Otherwise, ROCr will currently allow device-side queues that can cause faults when the user tries to touch their ring buffers and the user will not know why the faults are occuring. This relies on the fact that the KFD does not exposed any links from the CPU to the GPU if large BAR is not enabled (though links from the GPU to the CPU may still be exposed by the KFD).
I was going to test this but I noticed there is no longer any such function as |
Add flags to the queue interface to choose whether or not to allocate it in device memory.
Now the queue struct is also allocated in device memory.
Fixes a bug due to missing the executable flag when allocating the ring buffer in device memory.