Skip to content

[Offload] Offload to NVIDIA GPUs fails with CUDA 13.0 or newer (LLVM 20.1.8) #159088

@Thyre

Description

@Thyre

I'm still trying to figure out what goes wrong exactly, but maybe someone has an idea.

With LLVM 20.1.8, built with a bootstrapped build with EasyBuild, building any program using OpenMP offload via e.g. -fopenmp --offload-arch=sm_75 fails when running the application with:

omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: No images found compatible with the installed hardware. [1]    14988 segmentation fault (core dumped)  OMP_TARGET_OFFLOAD=mandatory ./zaxpy

Looking closer with GDB, the stack trace looks like this:

Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: No images found compatible with the installed hardware. 
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5286ccf in llvm::object::ELFObjectFileBase::getNVPTXCPUName() const () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/libLLVM.so.20.1
(gdb) bt
#0  0x00007ffff5286ccf in llvm::object::ELFObjectFileBase::getNVPTXCPUName() const () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/libLLVM.so.20.1
#1  0x00007ffff5286c53 in llvm::object::ELFObjectFileBase::tryGetCPUName() const () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/libLLVM.so.20.1
#2  0x00007ffff7a9cca1 in handleTargetOutcome(bool, ident_t*) () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#3  0x00007ffff7a97f43 in checkDevice(long&, ident_t*) () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#4  0x00007ffff7a984e0 in void targetData<AsyncInfoTy>(ident_t*, long, int, void**, void**, long*, long*, void**, void**, int (*)(ident_t*, DeviceTy&, int, void**, void**, long*, long*, void**, void**, AsyncInfoTy&, bool), char const*, char const*) ()
   from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#5  0x00007ffff7a980c4 in __tgt_target_data_begin_mapper () from /opt/EasyBuild/apps/software/LLVM/20.1.8-GCCcore-14.3.0/lib/x86_64-unknown-linux-gnu/libomptarget.so.20.1
#6  0x000055555555ae7f in main ()

Testing CUDA 12.9.1 or earlier, everything looks okay. It seems to only affect CUDA 13.0.0 and 13.0.1 so far.
I haven't tried LLVM 21.1.1 yet, mostly due to the only machine I'm able to test this with taking quite long to build LLVM with.

Its also worth noting that one can build the application with an older CUDA and then run with the newer one. The other way around also fails. Maybe some changes in between these major version causes issues. There were the announced ELF visibility and linkage changes, but as far as I understand, this only affects nvcc. The driver itself should be recent enough (580.65.06).

LLVM 20.1.8 is particularly interesting because of e.g. Numba supporting that particular version soon, while CUDA 13 is interesting for better support of recent GPUs. Mixing LLVM versions would be a noticeable inconvenience.

I'll now try to get a version of LLVM 21 built for cross-checking. Maybe the issue is already resolved and I just haven't found the correct PR for that yet.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Needs Triage

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions