Add flags to the core queue interface for device-side ring buf/queue descriptor allocation #284

atgutier · 2025-01-22T23:18:44Z

Add flags to the queue interface to choose whether or not to allocate it in device memory.

Now the queue struct is also allocated in device memory.

Fixes a bug due to missing the executable flag when allocating the ring buffer in device memory.

atgutier · 2025-01-22T23:21:06Z

FYI @benvanik this is still WIP, but let me know if this works for you and if the cacheable flag works/has benefits.

runtime/hsa-runtime/core/common/hsa_table_interface.cpp

runtime/hsa-runtime/core/runtime/hsa_api_trace.cpp

runtime/hsa-runtime/inc/hsa_ext_amd.h

amd-jmacaran · 2025-01-23T16:10:10Z

/AzurePipelines run

azure-pipelines · 2025-01-23T16:10:25Z

Azure Pipelines successfully started running 1 pipeline(s).

atgutier · 2025-01-24T23:28:16Z

runtime/hsa-runtime/core/runtime/amd_blit_kernel.cpp

@@ -1272,7 +1272,7 @@ void BlitKernel::PopulateQueue(uint64_t index, uint64_t code_handle, void* args,
  std::atomic_thread_fence(std::memory_order_acquire);
  queue_buffer[index & queue_bitmask_] = packet;
  std::atomic_thread_fence(std::memory_order_release);
-  if (core::Runtime::runtime_singleton_->flag().dev_mem_queue() && !queue_->needsPcieOrdering()) {
+  if (queue_->IsDeviceMem() && !queue_->needsPcieOrdering()) {


@saleelk I'm curious if the logic is correct for the original needsPcieOrdering() method you added. Shouldn't this really be:

queue_->needsPcieOrdering()

Meaning we need to change the logic of that call internally?

dayatsin-amd

Looks good to me. Thank you!

saleelk · 2025-01-29T22:39:23Z

runtime/hsa-runtime/inc/hsa_ext_amd.h

+   * The queue packet buffer and the queue struct should be allocated in
+   * the agent's device memory.
+   */
+  HSA_AMD_QUEUE_FLAG_DEVICE_MEM = (1 << 0),


We should explicitly mention whether its cached on uncached. Like HSA_AMD_QUEUE_HOST = 0, HSA_AMD_QUEUE_DEV_UNCACHED = 1 << 0,

Removed the cache flag for now.

saleelk

the change loosk good except for the flags I mentioned and if we should expose them, and thanks for fixing the typo I had.

saleelk · 2025-01-29T22:42:11Z

runtime/hsa-runtime/inc/hsa_ext_amd.h

+   * Used to indicate if the queue created in device memory should be
+   * cacheable.
+   */
+  HSA_AMD_QUEUE_FLAG_CACHEABLE = (1 << 1),


We should probably discuss if this is even doable at the moment, to flush L2 is costly and if we are writing a packet (aka 64bytes), we can probably also have false sharing issues/

This builds on a prior change that allowed for allocating a user-mode queue's packet buffer in device memory to also allocate the queue struct in device memory. This provides additional latency benefits particularly for cases where dispatches are performed from the GPU itself. Flags are added to support the various use cases.

Adds a bool to the GPU agent and a public member method to check if the GPU supports large BAR. This is needed so we can check if large BAR is supported when a user tries to allocate an AQL queue in device memory on a given GPU agent. Also adds an exception to the AQL queue if device-side AQL queues are requested and the GPU owner of the AQL doesn't support large BAR. Otherwise, ROCr will currently allow device-side queues that can cause faults when the user tries to touch their ring buffers and the user will not know why the faults are occuring. This relies on the fact that the KFD does not exposed any links from the CPU to the GPU if large BAR is not enabled (though links from the GPU to the CPU may still be exposed by the KFD).

misos1 · 2025-02-19T19:22:36Z

I was going to test this but I noticed there is no longer any such function as hsa_amd_queue_create, only hsa_queue_create which sets just one of the new flags based on ENV.

atgutier added the Feature Request label Jan 22, 2025

atgutier linked an issue Jan 22, 2025 that may be closed by this pull request

Add API flag for whether an AQL queue should be allocated in device memory. #269

Open

atgutier requested review from dayatsin-amd and saleelk January 22, 2025 23:20

dayatsin-amd requested changes Jan 23, 2025

View reviewed changes

runtime/hsa-runtime/core/common/hsa_table_interface.cpp Outdated Show resolved Hide resolved

runtime/hsa-runtime/core/runtime/hsa_api_trace.cpp Outdated Show resolved Hide resolved

runtime/hsa-runtime/inc/hsa_ext_amd.h Show resolved Hide resolved

dayatsin-amd force-pushed the amd-staging branch from ea66d58 to 9971e7b Compare January 24, 2025 03:20

atgutier force-pushed the atgutier/queue-struct-dev-mem branch from 76a3db2 to fe3307c Compare January 24, 2025 23:25

atgutier commented Jan 24, 2025

View reviewed changes

atgutier force-pushed the atgutier/queue-struct-dev-mem branch from fe3307c to 6bbccb0 Compare January 28, 2025 21:21

atgutier requested a review from dayatsin-amd January 28, 2025 21:21

atgutier force-pushed the atgutier/queue-struct-dev-mem branch from 6bbccb0 to 436b64a Compare January 28, 2025 21:23

dayatsin-amd approved these changes Jan 28, 2025

View reviewed changes

atgutier force-pushed the atgutier/queue-struct-dev-mem branch 9 times, most recently from 51dc622 to 4533c8d Compare January 29, 2025 20:25

atgutier requested a review from dayatsin-amd January 29, 2025 20:28

saleelk reviewed Jan 29, 2025

View reviewed changes

misos1 mentioned this pull request Feb 17, 2025

Add API flag for whether an AQL queue should be allocated in device memory. #269

Open

atgutier force-pushed the atgutier/queue-struct-dev-mem branch from 4533c8d to 47357bf Compare February 17, 2025 23:51

atgutier changed the title ~~Add API to create queue in device memory~~ Add flags to the core queue interface for device-side ring buf/queue descriptor allocation Feb 17, 2025

atgutier added the bug label Feb 18, 2025

atgutier added 3 commits February 18, 2025 14:07

rocr/libhsakmt: Add coarse-grain allocator to GPU

80b8892

rocr: Remove empty shared.cpp

a8921b7

atgutier force-pushed the atgutier/queue-struct-dev-mem branch from 47357bf to b601fbc Compare February 18, 2025 22:11

atgutier requested a review from saleelk February 18, 2025 22:45

atgutier removed a link to an issue Feb 18, 2025

Add API flag for whether an AQL queue should be allocated in device memory. #269

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flags to the core queue interface for device-side ring buf/queue descriptor allocation #284

Add flags to the core queue interface for device-side ring buf/queue descriptor allocation #284

atgutier commented Jan 22, 2025 •

edited

Loading

atgutier commented Jan 22, 2025

amd-jmacaran commented Jan 23, 2025

azure-pipelines bot commented Jan 23, 2025

atgutier Jan 24, 2025

dayatsin-amd left a comment

saleelk Jan 29, 2025

atgutier Feb 18, 2025

saleelk left a comment

saleelk Jan 29, 2025

atgutier Feb 18, 2025

misos1 commented Feb 19, 2025

Add flags to the core queue interface for device-side ring buf/queue descriptor allocation #284

Are you sure you want to change the base?

Add flags to the core queue interface for device-side ring buf/queue descriptor allocation #284

Conversation

atgutier commented Jan 22, 2025 • edited Loading

atgutier commented Jan 22, 2025

amd-jmacaran commented Jan 23, 2025

azure-pipelines bot commented Jan 23, 2025

atgutier Jan 24, 2025

Choose a reason for hiding this comment

dayatsin-amd left a comment

Choose a reason for hiding this comment

saleelk Jan 29, 2025

Choose a reason for hiding this comment

atgutier Feb 18, 2025

Choose a reason for hiding this comment

saleelk left a comment

Choose a reason for hiding this comment

saleelk Jan 29, 2025

Choose a reason for hiding this comment

atgutier Feb 18, 2025

Choose a reason for hiding this comment

misos1 commented Feb 19, 2025

atgutier commented Jan 22, 2025 •

edited

Loading