Improve performance of `extend_from_slice` where `T: Copy` #235

zslayton · 2024-02-17T16:21:40Z

When using a Vec<'_, u8>, I was surprised to see that my_vec.extend_from_slice("SUCCESS".as_bytes()); produced assembly that looks like this on x86_64:

(mac OS 14.3 Sonoma, Intel processor, rustc v1.76.0)

Notice that it's performing a reserve, error check, and push for each byte in the input slice. I had expected LLVM to reduce this to something like a single reserve, error check, and memcpy for the entire slice.

I'd like to contribute a change analogous to the one recently merged in #229 but for &[u8] instead of &str. Before I get started, I wanted to check whether 1) you'd be interested in merging such an optimization and 2) what the API should look like (since we don't have specialization to customize extend_from_slice for Vec<'_, u8>).

The text was updated successfully, but these errors were encountered:

overlookmotel · 2024-02-19T12:20:08Z

@zslayton Side question: What did you use to produce the visualization of assembly branch structure above? It's really nice!

Please note, we actually saw a very slight performance degredation from #229 in OXC oxc-project/oxc#2417. I believe this is due to OXC only dealing with really short strings, where copying byte-by-byte (presumably inlined) is actually faster than a call to copy_nonoverlapping. I don't know if this should be addressed or not, as it's probably an unusual case (std does not address it). I should have made a benchmark for short strings as well as long.

zslayton · 2024-02-19T14:38:07Z

What did you use to produce the visualization of assembly branch structure above?

I used Cutter, an open source reverse engineering tool. It's very helpful!

Please note, we actually saw a very slight performance degredation from #229 in OXC oxc-project/oxc#2417. I believe this is due to OXC only dealing with really short strings, where copying byte-by-byte (presumably inlined) is actually faster than a call to copy_nonoverlapping. I don't know if this should be addressed or not, as it's probably an unusual case (std does not address it). I should have made a benchmark for short strings as well as long.

That's good food for thought, thank you for raising it--I'll defer to @fitzgen as to whether it needs to be addressed. My use case is a mix of short strings and byte arrays that are anywhere from empty to several kilobytes in typical scenarios. The change in #236 was a very large performance increase in all of my existing benchmarks, but I didn't closely examine the performance of small slices.

zslayton · 2024-02-19T16:00:54Z

@overlookmotel I ended up extending the benchmark to test a variety of input sizes. You can see the results here.

zslayton · 2024-02-21T18:44:45Z

This was fixed by #236.

zslayton mentioned this issue Feb 18, 2024

Provides implementation of Vec::extend_from_slice optimized for T: Copy #236

Merged

zslayton closed this as completed Feb 21, 2024

zslayton mentioned this issue Feb 21, 2024

Modifies RawVec reserve fn structure to improve inlining #239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `extend_from_slice` where `T: Copy` #235

Improve performance of `extend_from_slice` where `T: Copy` #235

zslayton commented Feb 17, 2024

overlookmotel commented Feb 19, 2024 •

edited

Loading

zslayton commented Feb 19, 2024

zslayton commented Feb 19, 2024

zslayton commented Feb 21, 2024

Improve performance of extend_from_slice where T: Copy #235

Improve performance of extend_from_slice where T: Copy #235

Comments

zslayton commented Feb 17, 2024

overlookmotel commented Feb 19, 2024 • edited Loading

zslayton commented Feb 19, 2024

zslayton commented Feb 19, 2024

zslayton commented Feb 21, 2024

Improve performance of `extend_from_slice` where `T: Copy` #235

Improve performance of `extend_from_slice` where `T: Copy` #235

overlookmotel commented Feb 19, 2024 •

edited

Loading