-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable -Zshare-generics for inline(never) functions #123244
base: master
Are you sure you want to change the base?
Enable -Zshare-generics for inline(never) functions #123244
Conversation
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…enerics, r=<try> Enable -Zshare-generics for inline(never) functions This avoids inlining cross-crate generic items when possible that are already marked inline(never), implying that the author is not intending for the function to be inlined by callers. As such, having a local copy may make it easier for LLVM to optimize but mostly just adds to binary bloat and codegen time (in theory, TBD on in practice). It might also make sense it expand this with other heuristics (e.g., #[cold]). FWIW, I expect that doing cleanup in where we make the decision what should/shouldn't be shared is also a good idea. Way too much code needed to be tweaked to check this. r? `@Mark-Simulacrum` for perf at first
This comment has been minimized.
This comment has been minimized.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (1f2a5ec): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 667.994s -> 656.525s (-1.72%) |
5702d83
to
65a0301
Compare
It looks like the majority of the additional cost comes from additional indirection in calls to standard library functions, e.g., a diff like this: - f61064: e8 73 f4 49 03 call 44004dc <_RINvNtCsavh3npScQaX_5alloc7raw_vec11finish_growNtNtB4_5alloc6GlobalECs1UKmS5rlRwk_21rustc_trait_selection>
+ ffe57a: ff 15 a0 24 03 03 call *0x30324a0(%rip) # 4030a20 <_ZN5alloc7raw_vec11finish_grow17h78aea5cebcfaa28aE@Base> This means more work, particularly for short-lived compilations, since the symbol needs to get resolved at runtime now. That cost should be eliminated with #122362, which might take some time to land but is making progress now. Most downstream programs also don't pay it since they're not linking std dynamically. A little of the extra cost from there seems to be due to these non-inlined functions now being codegen'd with frame pointers (due to #122646). That doesn't seem like something worth worrying about, we accepted some regression from those changes already. My sense is that something like this is probably still a good idea despite the regressions. We do see good improvements in binary sizes (including >1MB of librustc_driver.so), and bootstrap times are significantly reduced. That suggests that this is a pretty good win for the larger crates while being a slight loss for smaller crates (instruction count timings are down for some of the larger primary crates as well, e.g., ripgrep and cranelift). That seems consistent with the loss due to additional indirection due to librustc_driver dynamically linking with the standard library. Going to mark as ready for review as such. r? compiler |
This comment was marked as resolved.
This comment was marked as resolved.
r? compiler |
cc #14527 |
Poking @wesleywiser as it's been ~3 weeks here. |
It's been another 3 weeks, and this looks really interesting. r? saethlin |
// This is generic, but it's only instantiated with a u32 argument and that instantiation is present | ||
// in the local crate (see F above). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this testing what we want? I was expecting to see a cross-crate call to a #[inline(never)]
generic function in a test, because I think the point of this PR is to change the behavior for such calls, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is indeed checking that behavior, though it's rather obscure. As per the comment just above ("These should not contribute...") we are implicitly checking that these functions are absent in the mono-items of the downstream crate (tests/codegen-units/partitioning/extern-generic.rs) since they're not listed as MONO-ITEM
declarations in that file. They are referenced though via the foo
function above which has to get codegen'd in the downstream crate since it is generic and never called in this one.
I suppose we could try to write a more direct test case? But this also tests that the property holds transitively (i.e., we don't codegen the called function twice with inline(never) even if it's not directly being called).
IIRC, before my changes, this would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see! This is very interconnected, and I should really be used to that by now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I checked and yes the test does fail without the rest of this PR.
This comment has been minimized.
This comment has been minimized.
…enerics, r=<try> Enable -Zshare-generics for inline(never) functions This avoids inlining cross-crate generic items when possible that are already marked inline(never), implying that the author is not intending for the function to be inlined by callers. As such, having a local copy may make it easier for LLVM to optimize but mostly just adds to binary bloat and codegen time. In practice our benchmarks indicate this is indeed a win for larger compilations, where the extra cost in dynamic linking to these symbols is diminished compared to the advantages in fewer copies that need optimizing in each binary. It might also make sense it expand this with other heuristics (e.g., `#[cold]`) in the future, but this seems like a good starting point. FWIW, I expect that doing cleanup in where we make the decision what should/shouldn't be shared is also a good idea. Way too much code needed to be tweaked to check this. But I'm hoping to leave that for a follow-up PR rather than blocking this on it.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (38d8997): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)Results (primary 1.5%, secondary 3.3%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary 2.3%, secondary 4.2%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary -0.3%, secondary -2.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: missing data |
Co-authored-by: scottmcm <scottmcm@users.noreply.github.com>
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…enerics, r=<try> Enable -Zshare-generics for inline(never) functions This avoids inlining cross-crate generic items when possible that are already marked inline(never), implying that the author is not intending for the function to be inlined by callers. As such, having a local copy may make it easier for LLVM to optimize but mostly just adds to binary bloat and codegen time. In practice our benchmarks indicate this is indeed a win for larger compilations, where the extra cost in dynamic linking to these symbols is diminished compared to the advantages in fewer copies that need optimizing in each binary. It might also make sense it expand this with other heuristics (e.g., `#[cold]`) in the future, but this seems like a good starting point. FWIW, I expect that doing cleanup in where we make the decision what should/shouldn't be shared is also a good idea. Way too much code needed to be tweaked to check this. But I'm hoping to leave that for a follow-up PR rather than blocking this on it.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (29a8b2d): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)Results (primary -1.6%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary 2.9%, secondary 5.2%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary -0.3%, secondary -2.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: missing data |
This list of affected benchmarks in terms of cycles is now way shorter, which is nice. I'm still in favor of this PR because I think it makes the behavior of The list of commits looks like this is still in a WIP state. I'm happy to approve this if you fix up the commits. |
Whoops I never changed the labels back to author. Did that now. |
You could do another perf run now that #122362 has landed. |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
🔒 Merge conflict This pull request and the master branch diverged in a way that cannot be automatically merged. Please rebase on top of the latest master branch, and let the reviewer approve again. How do I rebase?Assuming
You may also read Git Rebasing to Resolve Conflicts by Drew Blessing for a short tutorial. Please avoid the "Resolve conflicts" button on GitHub. It uses Sometimes step 4 will complete without asking for resolution. This is usually due to difference between how Error message
|
@Mark-Simulacrum |
This avoids inlining cross-crate generic items when possible that are
already marked inline(never), implying that the author is not intending
for the function to be inlined by callers. As such, having a local copy
may make it easier for LLVM to optimize but mostly just adds to binary
bloat and codegen time. In practice our benchmarks indicate this is
indeed a win for larger compilations, where the extra cost in dynamic
linking to these symbols is diminished compared to the advantages in
fewer copies that need optimizing in each binary.
It might also make sense it expand this with other heuristics (e.g.,
#[cold]
) in the future, but this seems like a good starting point.FWIW, I expect that doing cleanup in where we make the decision
what should/shouldn't be shared is also a good idea. Way too
much code needed to be tweaked to check this. But I'm hoping
to leave that for a follow-up PR rather than blocking this on it.