Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of extend_from_slice where T: Copy #235

Closed
zslayton opened this issue Feb 17, 2024 · 4 comments
Closed

Improve performance of extend_from_slice where T: Copy #235

zslayton opened this issue Feb 17, 2024 · 4 comments

Comments

@zslayton
Copy link
Contributor

When using a Vec<'_, u8>, I was surprised to see that my_vec.extend_from_slice("SUCCESS".as_bytes()); produced assembly that looks like this on x86_64:

image

(mac OS 14.3 Sonoma, Intel processor, rustc v1.76.0)

Notice that it's performing a reserve, error check, and push for each byte in the input slice. I had expected LLVM to reduce this to something like a single reserve, error check, and memcpy for the entire slice.

I'd like to contribute a change analogous to the one recently merged in #229 but for &[u8] instead of &str. Before I get started, I wanted to check whether 1) you'd be interested in merging such an optimization and 2) what the API should look like (since we don't have specialization to customize extend_from_slice for Vec<'_, u8>).

@overlookmotel
Copy link
Contributor

overlookmotel commented Feb 19, 2024

@zslayton Side question: What did you use to produce the visualization of assembly branch structure above? It's really nice!

Please note, we actually saw a very slight performance degredation from #229 in OXC oxc-project/oxc#2417. I believe this is due to OXC only dealing with really short strings, where copying byte-by-byte (presumably inlined) is actually faster than a call to copy_nonoverlapping. I don't know if this should be addressed or not, as it's probably an unusual case (std does not address it). I should have made a benchmark for short strings as well as long.

@zslayton
Copy link
Contributor Author

What did you use to produce the visualization of assembly branch structure above?

I used Cutter, an open source reverse engineering tool. It's very helpful!

Please note, we actually saw a very slight performance degredation from #229 in OXC oxc-project/oxc#2417. I believe this is due to OXC only dealing with really short strings, where copying byte-by-byte (presumably inlined) is actually faster than a call to copy_nonoverlapping. I don't know if this should be addressed or not, as it's probably an unusual case (std does not address it). I should have made a benchmark for short strings as well as long.

That's good food for thought, thank you for raising it--I'll defer to @fitzgen as to whether it needs to be addressed. My use case is a mix of short strings and byte arrays that are anywhere from empty to several kilobytes in typical scenarios. The change in #236 was a very large performance increase in all of my existing benchmarks, but I didn't closely examine the performance of small slices.

@zslayton
Copy link
Contributor Author

@overlookmotel I ended up extending the benchmark to test a variety of input sizes. You can see the results here.

@zslayton
Copy link
Contributor Author

This was fixed by #236.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants