-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Optimize for size in release builds #5909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Is lto = "thin" an option by the way, where does it land w.r.t time and size? |
Didn't test it, only now realized that |
CodSpeed Performance ReportMerging #5909 will improve performances by 15.12%Comparing Summary
Benchmarks breakdown
|
|
@bluss Didn't fully test the speed of using Binary size: 31MB |
|
I would be somewhat skeptical of using IMO, if we're looking to decrease binary size, we should be chasing other avenues first. Like changes to the code itself that decrease bloat. We have barely done any of that, and I expect there are probably a lot of low hanging fruits. |
Quick look at https://doc.rust-lang.org/cargo/reference/profiles.html#opt-level does mention that it will also disable loop vectorization. I will try
I am not too familiar with UVs code, but if there is anywhere you think I should be looking in particular, let me know and will explore in that direction! |
I would look for uses of generics and either remove them, limit the scope of their monomorphization or switch to |
|
The other dimension to look at are dependencies. We are not especially conservative with bringing in new dependencies. I wouldn't be surprised if there were some we could pluck out, either with a little extra code on our part or just different feature configurations. |
|
|
It seems to be mostly on par to using Thanks @BurntSushi for all the helpful information, will look into it! |
|
Ok, really confused right now, but benchmarks now report a performance increase? Huh? Thats unexpected 🤔 |
|
Did some reading on the flags, and it seems like optimizing for speed (levels 2 and 3) only really do more aggressive inlining and vectorization, as well as loop unrolling. Otherwise, optimizing for space has the same other optimizations as enabling optimizations (except for the ones outlined before). So I guess its a good thing to enable by default and it makes sense that the benchmarks do not report any slowdowns (the speedups are interesting tho). Think this is fine to set a ready for merge! reference: https://docs.rust-embedded.org/book/unsorted/speed-vs-size.html#optimize-for-speed On another note, the point raised by BurntSushi regarding build times i found to be quite important, so might have a look tomorrow and open another pr/issue if i find anything interesting to share :) |
|
Our CI benchmarks don't have the best coverage at the moment. We'll want to run some of the hyperfine benchmarks locally to test the effects of this. My prior here is that we should have our |
Would be happy to do so. The benchmarks I attached to the description of the issue seem to indicate no drastic slowdowns, but if you indicate which benchmarks I should perform, I'll perform them and post my findings here.
Vectorization will be just like Will report back! |
|
Would like to add this information to the discussion: it seems that in rustc 1.18.0+, using From what I can gather, optimizing for space with "s" is outputting an executable that is almost 2x smaller than the previous ones, while still staying mostly performant as previously. I would be interested in running proper benchmarks to actually see what the performance cost would be going from O3 + LTO to O2 + size + LTO (and playing around with flags to see what gives the best results) |
|
https://github.com/johnthagen/min-sized-rust is also a often a recommended reference, but I agree with @BurntSushi that there might be other areas to look at first |
|
I'm going to close this for now since we're not actively exploring it. |
Summary
This is a bit of a followup to #5904, which I did without properly testing this idea, as I expected it to provide way worst results than it seems to have done.
Apart from LTO, I also wanted to test setting
opt-levelto optimize for size ("s"). It wasn't the first test I did because I expected LTO to be more of an overall gain, but optimizing for size, not so much. It has still yielded good results in the limited speed testing I have done (which I was afraid it would not do), and wanted to share these here.The main idea behind these PRs is to slowdown the need for PyPI size increases pypi/support#4260 and overall just being more efficient where possible :)
Marking it as a draft for now to enable discussion (as it seems like this is a hot topic from what I have seen in the ruff repo and #5904).
Test Plan
Expand
Git commit for each of the executables:
uv-main: acbd367uv-lto: cac3c4duv-size: 50efcdeuv-size-no-lto: cde0aa1 with cac3c4d revertedSystem info:
Binary Size
Command run:
cargo build --releaseand then getting the binaries fromtargets/release/uvConclusion
Drastic improvement in size by enabling both LTO and optimizing for size!
Wheel Size
Command run:
uvx maturin build --releaseand then getting the wheels fromtargets/wheels/uv-0.2.34-py3-none-linux_x86_64.whlConclusion
Drastic improvement in size by enabling both LTO and optimizing for wheel size too!
Executable Speed
uv-lto:
uv-main:
uv-size-no-lto:
uv-size:
Result
I am not entirely sure why
uv-ltois slighly slower thanuv-maineven tho benchmarks say otherwise (re-ran the benchmark multiple times), but will leave it to codespeed to determine that by properly running the benchmarks. What I was aiming to prove with these benchmarks is that optimizing for speed doesn't actually slow down uv that much, mostly in what I would call an "acceptable" manner, considering how fast uv is already.Build times
Massive thanks to @T-256 for bringing up astral-sh/ruff#9224. After reading through it and its sub-links, this comment by zanieb really made me realize that maybe ensuring build times are not too drastic is something worth ensuring, as to not make it harder for developers getting into the needy giddy of optimizing (and as I have also found out, lol)
Numbers are taken directly from the "finished" message using
cargo build --release.They are not real benchmarks, they are single runs, and must be taken with a grain of salt, but give good indications of what is going on
Cold run (after
cargo clean)Conclusions: Slight improvement in build times compared to LTO builds, but still half a minute slower than what it was before enabling LTO
Warm run
Didn't really have any good ideas on what a warm run would look like, so I havent run these. If anybody knows a good way to do this, please do let me know and I will run them