Description
openedon Aug 10, 2021
Follow-up on #39877, which reported two separate issues.
During heavy JIT compilation, specific NVIDIA ARM devices seem to segfault when our JIT's memory manager invalidates CPU caches:
Thread 1 "julia" received signal SIGSEGV, Segmentation fault.
__aarch64_sync_cache_range (base=0x7fac5faf90, end=0x7fac5fafe0) at /workspace/srcdir/gcc-11.1.0/libgcc/config/aarch64/sync-cache.c:61
61 /workspace/srcdir/gcc-11.1.0/libgcc/config/aarch64/sync-cache.c: No such file or directory.
(gdb) bt
#0 __aarch64_sync_cache_range (base=0x7fac5faf90, end=0x7fac5fafe0) at /workspace/srcdir/gcc-11.1.0/libgcc/config/aarch64/sync-cache.c:61
#1 0x0000007fb487737c in llvm::sys::Memory::InvalidateInstructionCache(void const*, unsigned long) () from /home/tim/julia/usr/bin/../lib/libLLVM-12jl.so
#2 0x0000007fb7918438 in (anonymous namespace)::ROAllocator<false>::finalize (this=0x555562dc70) at /home/tim/julia/src/cgmemmgr.cpp:529
#3 (anonymous namespace)::DualMapAllocator<false>::finalize (this=0x555562dc70) at /home/tim/julia/src/cgmemmgr.cpp:662
#4 0x0000007fb7919e1c in (anonymous namespace)::RTDyldMemoryManagerJL::finalizeMemory (this=0x555562ca60, ErrMsg=<optimized out>) at /home/tim/julia/src/cgmemmgr.cpp:890
#5 0x0000007fb61aaaf4 in llvm::RuntimeDyldImpl::finalizeAsync(std::unique_ptr<llvm::RuntimeDyldImpl, std::default_delete<llvm::RuntimeDyldImpl> >, llvm::unique_function<void (llvm::object::OwningBinary<llvm::object::ObjectFile>, std::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo, std::default_delete<llvm::RuntimeDyld::LoadedObjectInfo> >, llvm::Error)>, llvm::object::OwningBinary<llvm::object::ObjectFile>, std::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo, std::default_delete<llvm::RuntimeDyld::LoadedObjectInfo> >)::{lambda(llvm::Expected<std::map<llvm::StringRef, llvm::JITEvaluatedSymbol, std::less<llvm::StringRef>, std::allocator<std::pair<llvm::StringRef const, llvm::JITEvaluatedSymbol> > > >)#1}::operator()(llvm::Expected<std::map<llvm::StringRef, llvm::JITEvaluatedSymbol, std::less<llvm::StringRef>, std::allocator<std::pair<llvm::StringRef const, llvm::JITEvaluatedSymbol> > > >) ()
from /home/tim/julia/usr/bin/../lib/libLLVM-12jl.so
This is on an NVIDIA Jetson AGX, which contains an NVIDIA-specific 8-core ARMv8 "carmel" CPU:
$ cat /proc/cpuinfo
processor : 0
model name : ARMv8 Processor rev 0 (v8l)
BogoMIPS : 62.50
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x4e
CPU architecture: 8
CPU variant : 0x0
CPU part : 0x004
CPU revision : 0
MTS version : 53250041
$ ./julia-f8d01c06bc/bin/julia -e 'using InteractiveUtils; versioninfo(verbose=true)'
Julia Version 1.8.0-DEV.322
Commit f8d01c06bc (2021-08-09 15:02 UTC)
Platform Info:
OS: Linux (aarch64-unknown-linux-gnu)
Ubuntu 18.04.5 LTS
uname: Linux 4.9.253-tegra #1 SMP PREEMPT Mon Jul 26 12:19:28 PDT 2021 aarch64 aarch64
CPU: unknown:
speed user nice sys idle irq
#1 1190 MHz 11032 s 0 s 2492 s 790054 s 855 s
#2 1190 MHz 10412 s 0 s 2044 s 793590 s 163 s
#3 1190 MHz 21233 s 28 s 998 s 784429 s 209 s
#4 1190 MHz 11652 s 0 s 903 s 794254 s 139 s
#5 1190 MHz 19191 s 0 s 1029 s 786214 s 200 s
#6 1190 MHz 13590 s 59 s 930 s 791896 s 157 s
#7 1190 MHz 37045 s 28 s 1013 s 768389 s 331 s
#8 1190 MHz 18772 s 0 s 938 s 786873 s 202 s
Memory: 15.445869445800781 GB (11692.98046875 MB free)
Uptime: 80707.02 sec
Load Avg: 0.08 0.06 0.04
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, carmel)
The segfault happens when executing the dc cvau
cache-clean instruction, and the segfault reported is of kind SEGV_MAPERR
(info->si_code = 1
). This may be a processor bug -- there's a couple of ARM errata regarding this instruction, but for Cortex-A53's -- but I haven't really debugged this. FWIW, we used to be able to run the CUDA.jl tests on this device, despite this issue causing the precompilation to segfault occasionally. That isn't true now anymore, and I seem to run into the segfault much more frequently than before.