Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flags to the core queue interface for device-side ring buf/queue descriptor allocation #284

Open
wants to merge 4 commits into
base: amd-staging
Choose a base branch
from

Conversation

atgutier
Copy link
Contributor

@atgutier atgutier commented Jan 22, 2025

Add flags to the queue interface to choose whether or not to allocate it in device memory.

Now the queue struct is also allocated in device memory.

Fixes a bug due to missing the executable flag when allocating the ring buffer in device memory.

@atgutier
Copy link
Contributor Author

FYI @benvanik this is still WIP, but let me know if this works for you and if the cacheable flag works/has benefits.

@amd-jmacaran
Copy link
Contributor

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -1272,7 +1272,7 @@ void BlitKernel::PopulateQueue(uint64_t index, uint64_t code_handle, void* args,
std::atomic_thread_fence(std::memory_order_acquire);
queue_buffer[index & queue_bitmask_] = packet;
std::atomic_thread_fence(std::memory_order_release);
if (core::Runtime::runtime_singleton_->flag().dev_mem_queue() && !queue_->needsPcieOrdering()) {
if (queue_->IsDeviceMem() && !queue_->needsPcieOrdering()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saleelk I'm curious if the logic is correct for the original needsPcieOrdering() method you added. Shouldn't this really be:

queue_->needsPcieOrdering()

Meaning we need to change the logic of that call internally?

@atgutier atgutier force-pushed the atgutier/queue-struct-dev-mem branch from fe3307c to 6bbccb0 Compare January 28, 2025 21:21
@atgutier atgutier requested a review from dayatsin-amd January 28, 2025 21:21
@atgutier atgutier force-pushed the atgutier/queue-struct-dev-mem branch from 6bbccb0 to 436b64a Compare January 28, 2025 21:23
Copy link
Collaborator

@dayatsin-amd dayatsin-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you!

@atgutier atgutier force-pushed the atgutier/queue-struct-dev-mem branch 9 times, most recently from 51dc622 to 4533c8d Compare January 29, 2025 20:25
@atgutier atgutier requested a review from dayatsin-amd January 29, 2025 20:28
* The queue packet buffer and the queue struct should be allocated in
* the agent's device memory.
*/
HSA_AMD_QUEUE_FLAG_DEVICE_MEM = (1 << 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explicitly mention whether its cached on uncached. Like HSA_AMD_QUEUE_HOST = 0, HSA_AMD_QUEUE_DEV_UNCACHED = 1 << 0,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the cache flag for now.

Copy link
Contributor

@saleelk saleelk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change loosk good except for the flags I mentioned and if we should expose them, and thanks for fixing the typo I had.

* Used to indicate if the queue created in device memory should be
* cacheable.
*/
HSA_AMD_QUEUE_FLAG_CACHEABLE = (1 << 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably discuss if this is even doable at the moment, to flush L2 is costly and if we are writing a packet (aka 64bytes), we can probably also have false sharing issues/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@atgutier atgutier force-pushed the atgutier/queue-struct-dev-mem branch from 4533c8d to 47357bf Compare February 17, 2025 23:51
@atgutier atgutier changed the title Add API to create queue in device memory Add flags to the core queue interface for device-side ring buf/queue descriptor allocation Feb 17, 2025
@atgutier atgutier added the bug label Feb 18, 2025
This builds on a prior change that allowed for allocating
a user-mode queue's packet buffer in device memory to also
allocate the queue struct in device memory. This provides
additional latency benefits particularly for cases where
dispatches are performed from the GPU itself. Flags are
added to support the various use cases.
Adds a bool to the GPU agent and a public member method to
check if the GPU supports large BAR. This is needed so we can
check if large BAR is supported when a user tries to allocate
an AQL queue in device memory on a given GPU agent.

Also adds an exception to the AQL queue if device-side AQL queues
are requested and the GPU owner of the AQL doesn't support large
BAR. Otherwise, ROCr will currently allow device-side queues
that can cause faults when the user tries to touch their ring
buffers and the user will not know why the faults are occuring.

This relies on the fact that the KFD does not exposed any links
from the CPU to the GPU if large BAR is not enabled (though
links from the GPU to the CPU may still be exposed by the KFD).
@misos1
Copy link

misos1 commented Feb 19, 2025

I was going to test this but I noticed there is no longer any such function as hsa_amd_queue_create, only hsa_queue_create which sets just one of the new flags based on ENV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants