Skip to content

ROCm blog post #490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 1, 2025
Merged

ROCm blog post #490

merged 4 commits into from
Jul 1, 2025

Conversation

Zeldhoron
Copy link
Contributor

No description provided.

@Zeldhoron Zeldhoron mentioned this pull request Jun 24, 2025
@boegel boegel self-assigned this Jun 25, 2025
Comment on lines 53 to 54
Currently, we're conducting our testing and building on a virtual machine environment without physical AMD GPU hardware.
This approach allows us to validate the build process and catch configuration issues, though it means some sanity checks that require actual GPU hardware are necessarily skipped during the build process.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note, something I noticed while going through the ROCm stack myself:
Most, if not all, of the ROCm stack I've handled so far doesn't actually check the hardware to build. Some packages have an option for this, but this is disabled by default.

Once we try to enable testing (e.g. ctest), actual hardware is required for the tests to work. Depending on the hardware, there's also a high likelihood that some tests fail. Don't know yet how we want to handle this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boegel what are your thoughts on this (how to handle failing tests)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the options would be to handle it similarly to LLVM. Maybe have some threshold we define in EasyConfigs for a percentage / flat amount of tests we expect to see failing.
I'll start enabling tests for my EasyConfigs soonish (once I find some time for it). Then, we may get a better sense of if (and how many) tests are failing.

For officially supported architectures, I expect this number to be quite small. If one tries to build for example for a not officially supported arch like gfx1152, I would expect to see quite some failures.
In EasyBuild, one could simply specify --ignore-test-failure. EESSI would probably not include support for these architectures.


Our high-priority next steps focus on validation and completeness.
We need to add proper support for AMD GPU drivers and hardware detection, ensuring our built components can actually communicate with AMD GPUs.
We're also working to finish adding support for remaining core components and AMD's validation suite and examples that demonstrate real-world functionality.
Copy link

@Thyre Thyre Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe link to both the ROCmValidationSuite repository and ROCm-examples?
Worth noting that the full ROCm-examples repository basically requires all the components, including the AI stuff, math libraries and so on.

ROCmValiationSuite is a somewhat easy target to reach, and I've verified that my EasyConfigs work on both gfx1201 (on Arch Linux with a system ROCm present as well) and gfx90a (in the EasyBuild Rocky Linux 9.5 container without ROCm) at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we noticed ROCm-examples is very comprehensive. If we can't manage to add support for all dependencies, we could maybe specify in the docs how people can clone the ROCm-examples repo themselves and build&run a subset of examples that is supported?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the ROCm examples are similarly structured to the CUDA examples, we unfortunately cannot disable certain parts easily. I agree though that we can write up some documentation to explain how users can run the examples.

Let's see first if we can build all the required modules though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just diving into a subdirectory and only running cmake - make in there wouldn't work for ROCm-examples?

Copy link

@Thyre Thyre Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking a few of the examples, that might actually work.
We just cannot disable certain examples if we build from the top-level (without patching e.g. including the directories out)

@boegel boegel marked this pull request as ready for review July 1, 2025 12:29
@boegel
Copy link
Contributor

boegel commented Jul 1, 2025

This looks ready to publish.
I know that @Zeldhoron is afk currently, but we've been in touch on this, so I'll go ahead and merge this PR so it's available via https://eessi.io/docs/blog/

@boegel boegel merged commit ecb53d7 into EESSI:main Jul 1, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants