Description
Hi folks. Chatted a bit on IRC, seemed to think this wasn't obviously a dup, so reporting here.
I'm using write_all
to push some bytes into a buffer. If I do this in-line, all goes well performance-wise (memcpy speeds; about 50GB/s on my machine). If I put it in a method, even with a #[inline(always)]
attribute, it drops down to about 1GB/s (and assembly looks like a loop doing something).
The problem goes away if I don't push the leading 24 bytes on using write_all
. Meaning, if I don't push them on, great! If I call push(0u8);
24 times, also great! Something about the existence of the preceding write_all
seems to tank the perf of the second write_all
(the big one). If I push 32 bytes (i.e. use a &[0u8; 32]
) the problem goes away as well (quadword alignment?).
But there never seems to be a problem with the manually inlined code; it always goes nice and fast.
extern crate time;
use std::io::Write;
fn main() {
let dataz = vec![0u8; 1 << 20];
let mut bytes = Vec::new();
let rounds = 1_000;
let start = time::precise_time_ns();
for _ in 0..rounds {
bytes.clear();
// these two: "average time: 81135"
// bytes.write_all(&[0u8; 24]).unwrap();
// bytes.write_all(&dataz[..]).unwrap();
// this one: "average time: 530736"
test(&dataz, &mut bytes)
}
println!("average time: {:?}", (time::precise_time_ns() - start) / rounds);
}
#[inline(always)]
fn test(typed: &Vec<u8>, bytes: &mut Vec<u8>) {
// comment first line out to go fast!
// weirdly, to me: if you replace the first line with 24x `bytes.push(0u8)` you get good performance.
bytes.write_all(&[0u8; 24]).unwrap();
bytes.write_all(&typed[..]).unwrap();
}
Edit: stable, beta, and nightly.