Open
Description
Consider this example:
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
define void @foo(ptr noalias %slice.0, ptr noalias %value) unnamed_addr {
start:
%0 = load i128, ptr %value, align 8
store i128 %0, ptr %slice.0, align 16
ret void
}
Godbolt: https://llc.godbolt.org/z/fMMPbacGx
Expected asm output:
foo: # @foo
movups xmm0, xmmword ptr [rsi]
movaps xmmword ptr [rdi], xmm0
ret
Actual asm output:
foo: # @foo
mov rax, qword ptr [rsi]
mov rcx, qword ptr [rsi + 8]
mov qword ptr [rdi + 8], rcx
mov qword ptr [rdi], rax
ret
Replacing the two occurrences of i128
with <2 x i64>
gives the expected output instead, so the issue appears to be specific to i128
. Alternatively, changing the align 8
to align 16
causes LLVM to correctly output movaps
instructions.
Real world background:
I noticed that filling a slice of u128
in Rust doesn't optimize as well as filling a slice of [u64; 2]
.
//type T = u128;
type T = [u64; 2];
pub fn foo(slice: &mut [T], value: &T) {
slice.fill(*value)
}