Skip to content

Segfault during JIT memory invalidation #41852

Closed

Description

Follow-up on #39877, which reported two separate issues.

During heavy JIT compilation, specific NVIDIA ARM devices seem to segfault when our JIT's memory manager invalidates CPU caches:

Thread 1 "julia" received signal SIGSEGV, Segmentation fault.
__aarch64_sync_cache_range (base=0x7fac5faf90, end=0x7fac5fafe0) at /workspace/srcdir/gcc-11.1.0/libgcc/config/aarch64/sync-cache.c:61
61      /workspace/srcdir/gcc-11.1.0/libgcc/config/aarch64/sync-cache.c: No such file or directory.
(gdb) bt
#0  __aarch64_sync_cache_range (base=0x7fac5faf90, end=0x7fac5fafe0) at /workspace/srcdir/gcc-11.1.0/libgcc/config/aarch64/sync-cache.c:61
#1  0x0000007fb487737c in llvm::sys::Memory::InvalidateInstructionCache(void const*, unsigned long) () from /home/tim/julia/usr/bin/../lib/libLLVM-12jl.so
#2  0x0000007fb7918438 in (anonymous namespace)::ROAllocator<false>::finalize (this=0x555562dc70) at /home/tim/julia/src/cgmemmgr.cpp:529
#3  (anonymous namespace)::DualMapAllocator<false>::finalize (this=0x555562dc70) at /home/tim/julia/src/cgmemmgr.cpp:662
#4  0x0000007fb7919e1c in (anonymous namespace)::RTDyldMemoryManagerJL::finalizeMemory (this=0x555562ca60, ErrMsg=<optimized out>) at /home/tim/julia/src/cgmemmgr.cpp:890
#5  0x0000007fb61aaaf4 in llvm::RuntimeDyldImpl::finalizeAsync(std::unique_ptr<llvm::RuntimeDyldImpl, std::default_delete<llvm::RuntimeDyldImpl> >, llvm::unique_function<void (llvm::object::OwningBinary<llvm::object::ObjectFile>, std::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo, std::default_delete<llvm::RuntimeDyld::LoadedObjectInfo> >, llvm::Error)>, llvm::object::OwningBinary<llvm::object::ObjectFile>, std::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo, std::default_delete<llvm::RuntimeDyld::LoadedObjectInfo> >)::{lambda(llvm::Expected<std::map<llvm::StringRef, llvm::JITEvaluatedSymbol, std::less<llvm::StringRef>, std::allocator<std::pair<llvm::StringRef const, llvm::JITEvaluatedSymbol> > > >)#1}::operator()(llvm::Expected<std::map<llvm::StringRef, llvm::JITEvaluatedSymbol, std::less<llvm::StringRef>, std::allocator<std::pair<llvm::StringRef const, llvm::JITEvaluatedSymbol> > > >) ()
   from /home/tim/julia/usr/bin/../lib/libLLVM-12jl.so

This is on an NVIDIA Jetson AGX, which contains an NVIDIA-specific 8-core ARMv8 "carmel" CPU:

$ cat /proc/cpuinfo 
processor       : 0
model name      : ARMv8 Processor rev 0 (v8l)
BogoMIPS        : 62.50
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
CPU implementer : 0x4e
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0x004
CPU revision    : 0
MTS version     : 53250041

$ ./julia-f8d01c06bc/bin/julia -e 'using InteractiveUtils; versioninfo(verbose=true)'
Julia Version 1.8.0-DEV.322
Commit f8d01c06bc (2021-08-09 15:02 UTC)
Platform Info:
  OS: Linux (aarch64-unknown-linux-gnu)
      Ubuntu 18.04.5 LTS
  uname: Linux 4.9.253-tegra #1 SMP PREEMPT Mon Jul 26 12:19:28 PDT 2021 aarch64 aarch64
  CPU: unknown: 
              speed         user         nice          sys         idle          irq
       #1  1190 MHz      11032 s          0 s       2492 s     790054 s        855 s
       #2  1190 MHz      10412 s          0 s       2044 s     793590 s        163 s
       #3  1190 MHz      21233 s         28 s        998 s     784429 s        209 s
       #4  1190 MHz      11652 s          0 s        903 s     794254 s        139 s
       #5  1190 MHz      19191 s          0 s       1029 s     786214 s        200 s
       #6  1190 MHz      13590 s         59 s        930 s     791896 s        157 s
       #7  1190 MHz      37045 s         28 s       1013 s     768389 s        331 s
       #8  1190 MHz      18772 s          0 s        938 s     786873 s        202 s
       
  Memory: 15.445869445800781 GB (11692.98046875 MB free)
  Uptime: 80707.02 sec
  Load Avg:  0.08  0.06  0.04
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, carmel)

The segfault happens when executing the dc cvau cache-clean instruction, and the segfault reported is of kind SEGV_MAPERR (info->si_code = 1). This may be a processor bug -- there's a couple of ARM errata regarding this instruction, but for Cortex-A53's -- but I haven't really debugged this. FWIW, we used to be able to run the CUDA.jl tests on this device, despite this issue causing the precompilation to segfault occasionally. That isn't true now anymore, and I seem to run into the segfault much more frequently than before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions