Closed
Description
Context: I recently saw a perf regression caused by the following change:
pub fn pixels_rgba(&self) -> Vec<u8> {
let mut output = Vec::new();
for p in &self.pixels { output.extend_from_slice(&[p.red(), p.green(), p.blue(), p.alpha()]) }
output
}
->
pub fn pixels_rgba(&self) -> Vec<u8> {
self.pixels.iter().flat_map(|p| [p.red(), p.green(), p.blue(), p.alpha()]).collect();
}
The change was reasonable, with assumption that this would be more idiomatic and guarantee that the output Vec is preallocated. Unfortunately, this made the function several times slower. Recently-merged #87168 helped here slightly, but the generated code is still much slower.
The same can also be observed for other code using flat_map
or flatten
, like a simple iteration over the iterator, and regardless of whether the flattened type has known size (array) or not.
I made an example benchmark in repo https://github.com/adrian17/flat_map_perf , with the following results on my machine (with today's nightly rustc):
tests::bench_array_4x500000_collect_loop 1,269,560 ns/iter (+/- 146,537)
tests::bench_array_4x500000_collect_loop_with_prealloc 1,255,140 ns/iter (+/- 165,287)
tests::bench_array_4x500000_collect_with_flat_map 2,697,082 ns/iter (+/- 303,411)
tests::bench_array_4x500000_iteration_nested_loop 220,838 ns/iter (+/- 25,307)
tests::bench_array_4x500000_iteration_flat_map 3,029,744 ns/iter (+/- 463,749)
tests::bench_iter_4000x500_collect_loop 243,537 ns/iter (+/- 34,574)
tests::bench_iter_4000x500_collect_loop_with_prealloc 243,246 ns/iter (+/- 34,197)
tests::bench_iter_4000x500_collect_with_flatten 3,521,586 ns/iter (+/- 597,755)
tests::bench_iter_4000x500_iteration_nested_loop 290,939 ns/iter (+/- 34,414)
tests::bench_iter_4000x500_iteration_flatten 2,099,386 ns/iter (+/- 512,732)
tests::bench_iter_4x500000_collect_loop 3,124,601 ns/iter (+/- 444,296)
tests::bench_iter_4x500000_collect_loop_with_prealloc 2,873,051 ns/iter (+/- 576,719)
tests::bench_iter_4x500000_collect_with_flatten 5,579,601 ns/iter (+/- 796,355)
tests::bench_iter_4x500000_iteration_nested_loop 2,118,351 ns/iter (+/- 396,325)
tests::bench_iter_4x500000_iteration_flatten 3,187,518 ns/iter (+/- 443,080)