Description
Problem Description
I am trying to use the new "AMD Modern Fortran Compiler" described here:
https://github.com/amd/InfinityHub-CI/tree/main/fortran
on my code that uses "do concurrent" for GPU-offload with optional OpenMP Target data movement (for GPUs/compiler that do not support unified memory).
The code is "HipFT" located publicly here:
github.com/predsci/hipft
The code works on NVIDIA GPUs with nvfortran and HPE, and on Intel GPUs with ifx.
It also compiles and runs on AMD server GPUs with HPE's CCE compiler (see https://arxiv.org/pdf/2408.07843)
I have compiled HDF5 1.14.3 (with a configure fix) and OpenMPI 5.0.6 with the amdflang and amdclang compiler to link to the code.
When I try to compiler with:
-O3 -fopenmp -fdo-concurrent-parallel=device --offload-arch=gfx906
I get:
LLVM ERROR: aborting
make: *** [Makefile:25: hipft.o] Error 1
I am using 'mpif90' to compile the code which is using the amdflang:
$ mpif90 -show
amdflang -I/opt/psi/amd/ext_deps/deps/openmpi-5.0.6/include -I/opt/psi/amd/ext_deps/deps/openmpi-5.0.6/lib -L/opt/psi/amd/ext_deps/deps/openmpi-5.0.6/lib -Wl,-rpath -Wl,/opt/psi/amd/ext_deps/deps/openmpi-5.0.6/lib -Wl,--enable-new-dtags -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi
If I try to compile without any OpenMP or Do Concurrent flags, the code compiles fine and runs correctly on 1 CPU core.
If I try to compile with just openmp turned on, and "do concurrent" set to host I get a lot of serialization warnings:
warning: loc("/home/caplanr/hipft/git_amd/src/hipft.f90":7683:7): Some
do concurrent loops are not perfectly-nested. These will be serialzied.
These concern me since if I cannot use DC with index ranges like "2:N-1" than I doubt the code will parallelize at all on either the GPU or CPU since a LOT of the loops are like that.
Note I also had to use: -L/opt/amdfort/llvm/lib -lomptarget
in this case otherwise it cannot find the OpenMP target data movement symbols (although they should not be being used in this case....).
Any help would be appreciated as I plan to present the code at SIAM's CSE meeting in a few months and would really like to have some AMD results.
-- Ron
Operating System
Rocky Linux 9.5 (Blue Onyx)
CPU
Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
GPU
AMD Radeon VII, gfx906, amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-, , amdgcn-amd-amdhsa--gfx9-generic:sramecc+:xnack-
ROCm Version
ROCm 6.2.3
ROCm Component
flang
Steps to Reproduce
My rocm is actually 6.2.4, but that is not on the list.
My linux kernel is: edge 6.10.6-1.el9.elrepo.x86_64
To reproduce, install the new AMD flang compiler from:
https://github.com/amd/InfinityHub-CI/tree/main/fortran
Next, clone the repo:
git clone https://github.com/predsci/hipft
Then, copy one of the build scripts from the build_examples
folder and edit the top portion to resemble this:
FC="mpif90"
HDF5_INCLUDE_DIR="${PS_EXT_DEPS_HOME}/hdf5/include"
HDF5_LIB_DIR="${PS_EXT_DEPS_HOME}/hdf5/lib"
HDF5_LIB_FLAGS="-lhdf5_fortran -lhdf5hl_fortran -lhdf5 -lhdf5_hl"
FFLAGS="-O3 -fopenmp --offload-arch=gfx906 -fdo-concurrent-parallel=device"
But:
- Replace the HDF5 paths with the ones to a HDF5 library compiled with amdflang.
- The mpif90 should also be associated with an MPI library compiled with amdflang.
- Replace the gfx906 with the correct GPU arch you are using.
Now, try to run the build script in the top level directory of the repo.
You should see:
./build_amd_gpu.sh
=== STARTING HIPFT BUILD ===
==> Entering src directory...
==> Removing old Makefile...
==> Generating Makefile from Makefile.template...
==> Compiling code...
!!> ERROR! hipft executable not found. Build most likely failed.
Contents of src/build.err:
LLVM ERROR: aborting
make: *** [Makefile:25: hipft.o] Error 1
You can go into the src
folder and try to edit the Makefile and recompile as needed.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
$ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
HSA System Attributes
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
Agent 1
Name: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
Uuid: CPU-XX
Marketing Name: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4000
BDFID: 0
Internal Node ID: 0
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx906
Uuid: GPU-b86490a172da5ee9
Marketing Name: AMD Radeon VII
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 26287(0x66af)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1801
BDFID: 3584
Internal Node ID: 1
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 472
SDMA engine uCode:: 145
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx9-generic:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
Here are my installed amd and rocm packages:
$ sudo dnf list --installed | grep amd
50:amd-smi-lib.x86_64 24.6.3.60204-139.el9 @rocm
51:amdgpu-core.noarch 1:6.2.60204-2070768.el9 @amdgpu
52:amdgpu-dkms.noarch 1:6.8.5.60204-2070768.el9 @amdgpu
53:amdgpu-dkms-firmware.noarch 1:6.8.5.60204-2070768.el9 @amdgpu
54:amdgpu-install.noarch 6.2.60204-2070768.el9 @@commandline
280:dkms.noarch 3.1.0-2.el9 @amdgpu
569:hip-runtime-amd.x86_64 6.2.41134.60204-139.el9 @rocm
592:hsa-amd-aqlprofile.x86_64 1.0.0.60204.60204-139.el9 @rocm
990:libdrm-amdgpu.x86_64 1:2.4.120.60204-2070768.el9 @amdgpu
991:libdrm-amdgpu-common.noarch 1.0.0.60204-2070768.el9 @amdgpu
992:libdrm-amdgpu-devel.x86_64 1:2.4.120.60204-2070768.el9 @amdgpu
1458:mesa-amdgpu-dri-drivers.x86_64 1:24.2.0.60204-2070768.el9 @amdgpu
1459:mesa-amdgpu-filesystem.x86_64 1:24.2.0.60204-2070768.el9 @amdgpu
1460:mesa-amdgpu-libGL.x86_64 1:24.2.0.60204-2070768.el9 @amdgpu
1461:mesa-amdgpu-va-drivers.x86_64 1:24.2.0.60204-2070768.el9 @amdgpu
2230:teamd.x86_64 1.31-16.el9_1 @baseos
$ sudo dnf list --installed | grep rocm
50:amd-smi-lib.x86_64 24.6.3.60204-139.el9 @rocm
178:comgr.x86_64 2.8.0.60204-139.el9 @rocm
179:composablekernel-devel.x86_64 1.1.0.60204-139.el9 @rocm
558:half.x86_64 1.12.0.60204-139.el9 @rocm
567:hip-devel.x86_64 6.2.41134.60204-139.el9 @rocm
568:hip-doc.x86_64 6.2.41134.60204-139.el9 @rocm
569:hip-runtime-amd.x86_64 6.2.41134.60204-139.el9 @rocm
570:hip-samples.x86_64 6.2.41134.60204-139.el9 @rocm
571:hipblas.x86_64 2.2.0.60204-139.el9 @rocm
572:hipblas-devel.x86_64 2.2.0.60204-139.el9 @rocm
573:hipblaslt.x86_64 0.8.0.60204-139.el9 @rocm
574:hipblaslt-devel.x86_64 0.8.0.60204-139.el9 @rocm
575:hipcc.x86_64 1.1.1.60204-139.el9 @rocm
576:hipcub-devel.x86_64 3.2.1.60204-139.el9 @rocm
577:hipfft.x86_64 1.0.16.60204-139.el9 @rocm
578:hipfft-devel.x86_64 1.0.16.60204-139.el9 @rocm
579:hipfort-devel.x86_64 0.4.0.60204-139.el9 @rocm
580:hipify-clang.x86_64 18.0.0.60204-139.el9 @rocm
581:hiprand.x86_64 2.11.1.60204-139.el9 @rocm
582:hiprand-devel.x86_64 2.11.1.60204-139.el9 @rocm
583:hipsolver.x86_64 2.2.0.60204-139.el9 @rocm
584:hipsolver-devel.x86_64 2.2.0.60204-139.el9 @rocm
585:hipsparse.x86_64 3.1.1.60204-139.el9 @rocm
586:hipsparse-devel.x86_64 3.1.1.60204-139.el9 @rocm
587:hipsparselt.x86_64 0.2.1.60204-139.el9 @rocm
588:hipsparselt-devel.x86_64 0.2.1.60204-139.el9 @rocm
589:hiptensor.x86_64 1.3.0.60204-139.el9 @rocm
590:hiptensor-devel.x86_64 1.3.0.60204-139.el9 @rocm
592:hsa-amd-aqlprofile.x86_64 1.0.0.60204.60204-139.el9 @rocm
593:hsa-rocr.x86_64 1.14.0.60204-139.el9 @rocm
594:hsa-rocr-devel.x86_64 1.14.0.60204-139.el9 @rocm
595:hsakmt-roct-devel.x86_64 20240607.5.7.60204-139.el9 @rocm
1477:migraphx.x86_64 2.10.0.60204-139.el9 @rocm
1478:migraphx-devel.x86_64 2.10.0.60204-139.el9 @rocm
1480:miopen-hip.x86_64 3.2.0.60204-139.el9 @rocm
1481:miopen-hip-devel.x86_64 3.2.0.60204-139.el9 @rocm
1482:mivisionx.x86_64 3.0.0.60204-139 @rocm
1483:mivisionx-devel.x86_64 3.0.0.60204-139 @rocm
1574:openmp-extras-devel.x86_64 18.62.0.60204-139.el9 @rocm
1575:openmp-extras-runtime.x86_64 18.62.0.60204-139.el9 @rocm
2048:rccl.x86_64 2.20.5.60204-139.el9 @rocm
2049:rccl-devel.x86_64 2.20.5.60204-139.el9 @rocm
2058:rocalution.x86_64 3.2.1.60204-139.el9 @rocm
2059:rocalution-devel.x86_64 3.2.1.60204-139.el9 @rocm
2060:rocblas.x86_64 4.2.4.60204-139.el9 @rocm
2061:rocblas-devel.x86_64 4.2.4.60204-139.el9 @rocm
2062:rocdecode.x86_64 0.6.0.60204-139 @rocm
2063:rocdecode-devel.x86_64 0.6.0.60204-139 @rocm
2064:rocfft.x86_64 1.0.30.60204-139.el9 @rocm
2065:rocfft-devel.x86_64 1.0.30.60204-139.el9 @rocm
2073:rocm.x86_64 6.2.4.60204-139.el9 @rocm
2074:rocm-cmake.x86_64 0.13.0.60204-139.el9 @rocm
2075:rocm-core.x86_64 6.2.4.60204-139.el9 @rocm
2076:rocm-dbgapi.x86_64 0.76.0.60204-139.el9 @rocm
2077:rocm-debug-agent.x86_64 2.0.3.60204-139.el9 @rocm
2078:rocm-developer-tools.x86_64 6.2.4.60204-139.el9 @rocm
2079:rocm-device-libs.x86_64 1.0.0.60204-139.el9 @rocm
2080:rocm-gdb.x86_64 14.2.60204-139.el9 @rocm
2081:rocm-hip-libraries.x86_64 6.2.4.60204-139.el9 @rocm
2082:rocm-hip-runtime.x86_64 6.2.4.60204-139.el9 @rocm
2083:rocm-hip-runtime-devel.x86_64 6.2.4.60204-139.el9 @rocm
2084:rocm-hip-sdk.x86_64 6.2.4.60204-139.el9 @rocm
2085:rocm-language-runtime.x86_64 6.2.4.60204-139.el9 @rocm
2086:rocm-llvm.x86_64 18.0.0.24392.60204-139.el9 @rocm
2087:rocm-ml-libraries.x86_64 6.2.4.60204-139.el9 @rocm
2088:rocm-ml-sdk.x86_64 6.2.4.60204-139.el9 @rocm
2089:rocm-opencl.x86_64 2.0.0.60204-139.el9 @rocm
2090:rocm-opencl-devel.x86_64 2.0.0.60204-139.el9 @rocm
2091:rocm-opencl-icd-loader.x86_64 1.2.60204-139.el9 @rocm
2092:rocm-opencl-runtime.x86_64 6.2.4.60204-139.el9 @rocm
2093:rocm-opencl-sdk.x86_64 6.2.4.60204-139.el9 @rocm
2094:rocm-openmp-sdk.x86_64 6.2.4.60204-139.el9 @rocm
2095:rocm-smi-lib.x86_64 7.3.0.60204-139.el9 @rocm
2096:rocm-utils.x86_64 6.2.4.60204-139.el9 @rocm
2097:rocminfo.x86_64 1.0.0.60204-139.el9 @rocm
2098:rocprim-devel.x86_64 3.2.2.60204-139.el9 @rocm
2099:rocprofiler.x86_64 2.0.60204.60204-139.el9 @rocm
2100:rocprofiler-devel.x86_64 2.0.60204.60204-139.el9 @rocm
2101:rocprofiler-plugins.x86_64 2.0.60204.60204-139.el9 @rocm
2102:rocprofiler-register.x86_64 0.4.0.60204-139.el9 @rocm
2103:rocprofiler-sdk.x86_64 0.4.0-139.el9 @rocm
2104:rocprofiler-sdk-roctx.x86_64 0.4.0-139.el9 @rocm
2105:rocrand.x86_64 3.1.1.60204-139.el9 @rocm
2106:rocrand-devel.x86_64 3.1.1.60204-139.el9 @rocm
2107:rocsolver.x86_64 3.26.2.60204-139.el9 @rocm
2108:rocsolver-devel.x86_64 3.26.2.60204-139.el9 @rocm
2109:rocsparse.x86_64 3.2.1.60204-139.el9 @rocm
2110:rocsparse-devel.x86_64 3.2.1.60204-139.el9 @rocm
2111:rocthrust-devel.x86_64 3.1.1.60204-139.el9 @rocm
2112:roctracer.x86_64 4.1.60204.60204-139.el9 @rocm
2113:roctracer-devel.x86_64 4.1.60204.60204-139.el9 @rocm
2114:rocwmma-devel.x86_64 1.5.0.60204-139.el9 @rocm
2127:rpp.x86_64 1.8.0.60204-139.el9 @rocm
2128:rpp-devel.x86_64 1.8.0.60204-139.el9 @rocm