Description
openedon Oct 15, 2024
I use julia on a heterogeneous compute cluster, consisting of many different nodes with different CPUs, but using a single shared file system. This has created problems before: In 1.10, precompilation is triggered each time a project is used on a different node. However, this was easily fixed by using separate projects. In 1.11 this no longer seems to be the case.
For example, let's say I create two projects, env1
and env2
. I load env1
on my workstation (Intel Xeon W-2223), add a package (say JLD2) and precompile. Then I load env2
on a compute node (AMD EPYC 7302) and add the same package. Despite the different CPU, no precompilation is triggered. Then, when I try to run some code, julia crashes on an invalid instruction:
julia> using JLD2
julia> jldsave("test.jld2", a=rand(100))
Invalid instruction at 0x1552538da346: 0x62, 0xf2, 0xfd, 0x28, 0x7c, 0xc0, 0xc4, 0xc1, 0x7e, 0x7f, 0x44, 0x24, 0x10, 0x4d, 0x89
[1482091] signal 4 (2): Illegal instruction
in expression starting at REPL[3]:1
MmapIO at /home/sschult/.julia/packages/JLD2/3zWRM/src/io/mmapio.jl:14 [inlined]
MmapIO at /home/sschult/.julia/packages/JLD2/3zWRM/src/io/mmapio.jl:113
openfile at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:146 [inlined]
openfile at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:151
#jldopen#22 at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:215
jldopen at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:164 [inlined]
#jldopen#23 at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:286 [inlined]
jldopen at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:279 [inlined]
#jldsave#107 at /home/sschult/.julia/packages/JLD2/3zWRM/src/loadsave.jl:286
jldsave at /home/sschult/.julia/packages/JLD2/3zWRM/src/loadsave.jl:283 [inlined]
jldsave at /home/sschult/.julia/packages/JLD2/3zWRM/src/loadsave.jl:283
unknown function (ip: 0x15525cb97c66)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
do_call at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:663
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:821
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
eval_user_input at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:226
repl_backend_loop at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:323
#start_repl_backend#59 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:308
start_repl_backend at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:305
#run_repl#72 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:464
run_repl at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:450
jfptr_run_repl_10212 at /home/sschult/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
#1138 at ./client.jl:446
jfptr_YY.1138_14881 at /home/sschult/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-1/julialang/julia-master/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1054 [inlined]
invokelatest at ./essentials.jl:1051 [inlined]
run_main_repl at ./client.jl:430
repl_main at ./client.jl:567 [inlined]
_start at ./client.jl:541
jfptr__start_72051.1 at /home/sschult/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
true_main at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:1059
main at /cache/build/builder-amdci5-1/julialang/julia-master/cli/loader_exe.c:58
__libc_start_call_main at /lib64/libc.so.6 (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 1547120 (Pool: 1547037; Big: 83); GC: 3
Illegal instruction (core dumped)
The instruction in question appears to be vpbroadcastq
from the AVX-512 instruction set, which, indeed, the Intel Xeon W-2223 supports and the AMD EPYC 7302 does not.
- This works without errors in 1.10.5.
- This problem is not specific to JLD2. I have obtained similar errors when using, for example, CairoMakie.jl, CUDA.jl or Arrow.jl, which error on other instructions from the same set.
- If I instead use
env2
on another node with e.g. an Intel Xeon E5-2698 v3, which also does not support AVX-512, the behaviour is different: precompilation is triggered, and no error is thrown. - If I delete .julia/compiled/v1.11, and first load
env2
on the compute node, everything works fine, until I useenv1
on my workstation, after which the same error occurs on the compute node.