Open
Description
I inspected the generated assembly code and benchmarked zeroize
for [u8; 32]
on x86_64 and found it quite inefficient, storing one byte at a time:
On my Ryzen CPU, it takes ~7.8324 ns, or ~1cpb. Binary code size is also quite large.
Using inline assembly (just stabilized in 1.59) and SSE2, zeroing a [u8; 32]
takes just 3 instructions and ~492.87 ps (~16 bytes per cycle):
let mut buf: [u8; 32];
core::arch::asm!(
"xorps {zero}, {zero}",
"movups {zero}, ({ptr})",
"movups {zero}, 16({ptr})",
zero = out(xmm_reg) _,
ptr = in(reg) &mut buf,
options(att_syntax, nostack, preserves_flags),
);
So it might be something worth optimizing/documenting.
If you do not want to use inline assembly, maybe you should encourage using larger types or SIMD types, e.g., [u64; 4]
or [__m128; 2]
instead of [u8; 32]
. Using write_volatile
on *mut __m128
generates equally compact and efficient code as the assembly code above.
Metadata
Metadata
Assignees
Labels
No labels