-
Notifications
You must be signed in to change notification settings - Fork 41
ROCm blog post #490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm blog post #490
Conversation
Currently, we're conducting our testing and building on a virtual machine environment without physical AMD GPU hardware. | ||
This approach allows us to validate the build process and catch configuration issues, though it means some sanity checks that require actual GPU hardware are necessarily skipped during the build process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a note, something I noticed while going through the ROCm stack myself:
Most, if not all, of the ROCm stack I've handled so far doesn't actually check the hardware to build. Some packages have an option for this, but this is disabled by default.
Once we try to enable testing (e.g. ctest
), actual hardware is required for the tests to work. Depending on the hardware, there's also a high likelihood that some tests fail. Don't know yet how we want to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boegel what are your thoughts on this (how to handle failing tests)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the options would be to handle it similarly to LLVM. Maybe have some threshold we define in EasyConfigs for a percentage / flat amount of tests we expect to see failing.
I'll start enabling tests for my EasyConfigs soonish (once I find some time for it). Then, we may get a better sense of if (and how many) tests are failing.
For officially supported architectures, I expect this number to be quite small. If one tries to build for example for a not officially supported arch like gfx1152
, I would expect to see quite some failures.
In EasyBuild, one could simply specify --ignore-test-failure
. EESSI would probably not include support for these architectures.
|
||
Our high-priority next steps focus on validation and completeness. | ||
We need to add proper support for AMD GPU drivers and hardware detection, ensuring our built components can actually communicate with AMD GPUs. | ||
We're also working to finish adding support for remaining core components and AMD's validation suite and examples that demonstrate real-world functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe link to both the ROCmValidationSuite repository and ROCm-examples?
Worth noting that the full ROCm-examples repository basically requires all the components, including the AI stuff, math libraries and so on.
ROCmValiationSuite is a somewhat easy target to reach, and I've verified that my EasyConfigs work on both gfx1201
(on Arch Linux with a system ROCm present as well) and gfx90a
(in the EasyBuild Rocky Linux 9.5 container without ROCm) at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we noticed ROCm-examples is very comprehensive. If we can't manage to add support for all dependencies, we could maybe specify in the docs how people can clone the ROCm-examples repo themselves and build&run a subset of examples that is supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the ROCm examples are similarly structured to the CUDA examples, we unfortunately cannot disable certain parts easily. I agree though that we can write up some documentation to explain how users can run the examples.
Let's see first if we can build all the required modules though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just diving into a subdirectory and only running cmake
- make
in there wouldn't work for ROCm-examples
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking a few of the examples, that might actually work.
We just cannot disable certain examples if we build from the top-level (without patching e.g. including the directories out)
This looks ready to publish. |
No description provided.