Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: ROCR uses the kfd ipc ioctls on the mainline kernel #270

Open
IMbackK opened this issue Dec 18, 2024 · 3 comments
Open

[Issue]: ROCR uses the kfd ipc ioctls on the mainline kernel #270

IMbackK opened this issue Dec 18, 2024 · 3 comments
Assignees

Comments

@IMbackK
Copy link

IMbackK commented Dec 18, 2024

Problem Description

Currently ROCR uses drmbuf over kfd only when HSA_ENABLE_IPC_MODE_LEGACY is set to zero see:

enable_ipc_mode_legacy_ = (var == "0") ? false : true; // Legacy mode by default

with no regard for if the ioctls required for legacy mode are even supported in the running kernel, causing obscure and difficult to debug issues like ROCm/rccl#1454

ROCR should

  1. use dmabuf by default when running on the mainline kernel or
  2. at least assert with a resonable error message when HSA_ENABLE_IPC_MODE_LEGACY=0 is not set but the kernel dosent support the required ioctls for legacy mode.

Operating System

any

CPU

any

GPU

any

ROCm Version

ROCm 6.3.0

ROCm Component

ROCR-Runtime

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@dayatsin-amd
Copy link
Collaborator

Hi @IMbackK,
Using the dmabuf for IPC is the long term goal, but this feature was recently implemented and we still have some stability issues with that implementation. This is the reason why we have disabled it in the current code. These stability issues are currently being investigated. Once they are fixed, we will enable option to allow it.

@IMbackK
Copy link
Author

IMbackK commented Dec 18, 2024

IMO rocr should then at least print something like "not supported configuration" and abort when run on the mainline kernel and the ipc mechanism is used.

@dayatsin-amd
Copy link
Collaborator

Yes, we are currently looking for an elegant way to handle this error because the current issue is that ROCr gets the same error code when the IOCTL does not exist and when the IOCTL fails for other reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants