Description
A generated executable occasionally fails to launch when built with the rustc options -Ctarget-feature=+avx -Copt-level=2 -Clto
.
I tried this code:
fn main(){}
Compiled with the following shell script:
#!/bin/sh
rustc main.rs -Ctarget-feature=+avx -C opt-level=3 -Clto -g
When I ran the generated executable main
repeatedly, the execution of the program stalled (did not terminate nor output anything; did not even enter the main
function) 5 out of 100 times.
When I ran the executable from lldb
, I could see that EXC_BAD_ACCESS
had occured because it attempted to load a 32-byte block from an unaligned memory using vmovdqa
(which requires the operand address to be 32-byte aligned).
- thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x0000000100000bf6 main`main + 518
main`main:
-> 0x100000bf6 <+518>: vmovdqa (%rax), %ymm0
0x100000bfa <+522>: movl $0x1, %ecx
0x100000bff <+527>: vmovq %rcx, %xmm1
0x100000c04 <+532>: vmovdqa %ymm1, (%rax)
(lldb) register read
General Purpose Registers:
rax = 0x0000000100300470
Meta
rustc --version --verbose
:
rustc 1.21.0-nightly (469a6f9bd 2017-08-22)
binary: rustc
commit-hash: 469a6f9bd9aef394c5cff6b3bc41b8c520f9515b
commit-date: 2017-08-22
host: x86_64-apple-darwin
release: 1.21.0-nightly
LLVM version: 4.0
The output of sample
(a tool that comes with macOS) when the program is stalled:
Call graph:
2721 Thread_15178881 DispatchQueue_1: com.apple.main-thread (serial)
2721 start (in libdyld.dylib) + 1 [0x7fffa220d235]
2721 0x0
2721 _sigtramp (in libsystem_platform.dylib) + 26 [0x7fffa241cb3a]
2721 std::sys::imp::stack_overflow::imp::signal_handler (in main) + 125 [0x105c58b7d] mem.rs:609
Analysis
The offending instruction is supposedly a part of libcore::ptr::swap_nonoverlapping_bytes
, which is called during the execution of libstd::thread::local::LocalKey::init
, which is called when the runtime is being initialized.
#[inline]
unsafe fn swap_nonoverlapping_bytes(x: *mut u8, y: *mut u8, len: usize) {
// <snip>
#[cfg_attr(not(any(target_os = "emscripten", target_os = "redox",
target_endian = "big")),
repr(simd))]
struct Block(u64, u64, u64, u64);
// <snip>
// Swap a block of bytes of x & y, using t as a temporary buffer
// This should be optimized into efficient SIMD operations where available
copy_nonoverlapping(x, t, block_size); // <--- HERE
// <snip>
}
After the optimization, this call to the intrinsic function copy_nonoverlapping
is translated into the following LLVM instruction:
%t.0.copyload.i.i.i.i.i.i.i.i.i = load <4 x i64>, <4 x i64>* bitcast ({ { { i64, [32 x i8] } }, { { i1 } }, { { i1 } }, [6 x i8] }* @_ZN3std10sys_common11thread_info11THREAD_INFO7__getit5__KEY17h80e4cdc49b84860aE to <4 x i64>*), align 32, !dbg !3742, !noalias !3762
This is translated into the following x86_64 instruction:
vmovdqa (%rax), %ymm0