-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop peeling the last iteration of the loop in Vec::resize_with
#104818
Conversation
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit a8954f1 with merge 8575d87094db626f2ff3e7a4b8d1af877b9a3095... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (8575d87094db626f2ff3e7a4b8d1af877b9a3095): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
Those perf changes look quite tolerable to me. LLVM can't optimize out the split loop here today (https://rust.godbolt.org/z/b5KEc6r8G), leaving two callsites for the closure in the loop: So even if LLVM takes a bit longer to optimize now sometimes, that seems fine to me because it's plausible that it's from it being willing to inline more, for example, because more callsites (as it was before this PR) suppresses inlining. r? @the8472 |
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
Have you checked the new assembly? It switches from external to internal iteration which then goes through Take::try_fold of which I don't know how well it optimizes.
I'm a bit more concerned about Other than that it looks fine. |
Hmm, good call. I'd checked that it has the one callsite with a tight loop https://rust.godbolt.org/z/747xco7dq, but it looks like it's slightly more stack traffic that way -- something doesn't get fully register-promoted. I'll take a look and see if I can improve @rustbot author |
ptr = ptr.add(1); | ||
// Since the loop executes user code which can panic we have to bump the pointer | ||
// after each step. | ||
ptr::write(ptr.add(local_len.current_len()), element); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pointless, but seems like it reduces register pressure since we have to maintain the local_len
for panic safety anyway.
Before:
.LBB4_3:
mov qword ptr [rbp - 16], rax
mov qword ptr [rbp - 24], rcx ; Note Spill
call make_thing
mov rcx, qword ptr [rbp - 24]
mov dword ptr [rcx], eax
add rcx, 4
mov rax, qword ptr [rbp - 16]
dec rax
cmp rsi, rax
jne .LBB4_3
After:
.LBB6_3:
mov qword ptr [rbp - 16], rax
call make_thing
mov rdx, qword ptr [rbp - 16]
lea rcx, [rdx + 1]
mov dword ptr [rbx + 4*rdx], eax ; Note addressing mode
mov rax, rcx
dec rsi
jne .LBB6_3
(And LLVM understands that kind of loop very well too, since it's a for i in A..B { v[i] = foo(); }
loop that's common all over the place.)
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 9d68a1a with merge 9fa29ccfda5f8e07d329c703730ac4f591d738c4... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (9fa29ccfda5f8e07d329c703730ac4f591d738c4): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
Ah, there we go, that's much better. Net win on instructions for both primary and secondary, the scary 90% wall time hit to a secondary is gone, and the opt-full regression for @rustbot ready |
fn check<'a, Item>( | ||
mut action: impl FnMut(Item) + 'a, | ||
) -> impl FnMut(usize, Item) -> Option<usize> + 'a { | ||
move |more, x| { | ||
action(x); | ||
more.checked_sub(1) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pity that the duplication-by-unused-generics issues still haven't been fixed 😮💨
@bors r+ rollup=never |
⌛ Testing commit 9d68a1a with merge b181d0623de0e9446e16accca83ee9d2eb736dd5... |
☀️ Test successful - checks-actions |
Finished benchmarking commit (faf1891): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDNext Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
Perf changes are very small, and wins slightly outweigh losses. @rustbot label: +perf-regression-triaged |
resize_with
uses theExtendWith
code that peels the last iteration:rust/library/alloc/src/vec/mod.rs
Lines 2525 to 2529 in 341d8b8
But that's kinda weird for
ExtendFunc
because it does the same thing on the last iteration anyway:rust/library/alloc/src/vec/mod.rs
Lines 2494 to 2502 in 341d8b8
So this just has it use the normal
extend
-from-TrustedLen
code instead.r? @ghost