-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Description
Adding the follow method as part of a benchmark to library/alloc/benches/vec.rs
#[repr(transparent)]
pub struct Foo(usize);
#[inline(never)]
pub fn vec_cast(input: Vec<Foo>) -> Vec<usize> {
input.into_iter().map(|e| unsafe { std::mem::transmute(e) }).collect()
}which exercises this specialization in Vec
results in the following assembly (extracted with objdump):
0000000000086130 <collectionsbenches::vec::vec_cast>:
86130: 48 8b 0e mov (%rsi),%rcx
86133: 48 89 f8 mov %rdi,%rax
86136: 48 8b 56 08 mov 0x8(%rsi),%rdx
8613a: 48 8b 7e 10 mov 0x10(%rsi),%rdi
8613e: 48 89 ce mov %rcx,%rsi
86141: 48 85 ff test %rdi,%rdi
86144: 74 10 je 86156 <collectionsbenches::vec::vec_cast+0x26>
86146: 48 8d 34 f9 lea (%rcx,%rdi,8),%rsi
8614a: 48 c1 e7 03 shl $0x3,%rdi
8614e: 66 90 xchg %ax,%ax
86150: 48 83 c7 f8 add $0xfffffffffffffff8,%rdi ; <= RDI unused from here onwards
86154: 75 fa jne 86150 <collectionsbenches::vec::vec_cast+0x20>
86156: 48 29 ce sub %rcx,%rsi
86159: 48 89 08 mov %rcx,(%rax)
8615c: 48 89 50 08 mov %rdx,0x8(%rax)
86160: 48 c1 fe 03 sar $0x3,%rsi
86164: 48 89 70 10 mov %rsi,0x10(%rax)
86168: c3 retq The ghidra decompile for the same function (comments are mine):
void collectionsbenches::vec::vec_cast(long *param_1,long *param_2)
{
long lVar1;
long lVar2;
long lVar3;
long lVar4;
lVar1 = *param_2; // pointer
lVar2 = param_2[1]; // capacity
lVar4 = param_2[2]; // len
lVar3 = lVar1;
if (lVar4 != 0) {
lVar3 = lVar1 + lVar4 * 8; // end pointer of vec::IntoIter
lVar4 = lVar4 << 3; // len in bytes
do {
lVar4 = lVar4 + -8;
} while (lVar4 != 0); // <== lVar4 unused from here onwards
}
*param_1 = lVar1; // pointer
param_1[1] = lVar2; // capacity
param_1[2] = lVar3 - lVar1 >> 3; // len from pointer difference
return;
}Note the useless loop.
The number of loop iterations (or rather the pointer increments) is needed to calculate the new length of the output Vec. LLVM already manages to hoist lVar3 = lVar1 + lVar4 * 8; but then it fails to eliminate the now-useless loop.
The issue does not occur if one uses input.into_iter().flat_map(|e| None).collect() instead, which always results in length == 0.
I tried several variations of the loop (e.g. replacing try_fold with a simple while let Some() ...) but it generally results in the same or worse assembly.
Note: The assembly looks somewhat different if I run this on godbolt but the decrementing loop without side-effect is still there. I assume the differences are due to LTO or some other compiler settings.
Tested on commit a1a13b2 2020-11-21 22:46
@rustbot modify labels: +I-slow