Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize IntRange::from_pat, then shrink ParamEnv #77257

Merged
merged 2 commits into from
Sep 29, 2020

Conversation

ecstatic-morse
Copy link
Contributor

@ecstatic-morse ecstatic-morse commented Sep 27, 2020

Resolves #77058.

r? @Mark-Simulacrum
cc @vandenheuvel

Looking at the output of perf report for #76244, the hot instructions seemed to be around the call to pat_constructor in IntRange::from_pat. I carried out an obvious optimization, but it actually made the instruction count higher (see #77075). However, it seems to have mitigated whatever was causing the pipeline stalls, so when combined with #76244, it's a net win.

As you can see below, the regression in #76244 seems to have originated from something measured by stalled-cycles-backend. I'll try to collect some finer-grained stats to see if I can isolate it. I wish I had a better idea of what was going on here. I'd like to prevent the regression from reappearing in the future due to small changes in unrelated code.

Current `master`:
 Performance counter stats for 'cargo +baseline-stage1 check':

          2,275.67 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            49,826      page-faults:u             #    0.022 M/sec                  
     5,117,221,678      cycles:u                  #    2.249 GHz                    
       299,655,943      stalled-cycles-frontend:u #    5.86% frontend cycles idle   
     2,284,213,395      stalled-cycles-backend:u  #   44.64% backend cycles idle    
     8,051,871,959      instructions:u            #    1.57  insn per cycle         
                                                  #    0.28  stalled cycles per insn
     1,359,589,402      branches:u                #  597.447 M/sec                  
         7,359,347      branch-misses:u           #    0.54% of all branches        

       2.281030026 seconds time elapsed

       2.108197000 seconds user
       0.164183000 seconds sys
Shrink `ParamEnv` without changing `IntRange::from_pat`:
 Performance counter stats for 'cargo +perf-stage1 check':

          2,751.79 msec task-clock:u              #    0.996 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            50,103      page-faults:u             #    0.018 M/sec                  
     6,260,590,019      cycles:u                  #    2.275 GHz                    
       317,355,920      stalled-cycles-frontend:u #    5.07% frontend cycles idle   
     3,397,743,582      stalled-cycles-backend:u  #   54.27% backend cycles idle    
     8,276,224,367      instructions:u            #    1.32  insn per cycle         
                                                  #    0.41  stalled cycles per insn
     1,370,453,386      branches:u                #  498.023 M/sec                  
         7,281,031      branch-misses:u           #    0.53% of all branches        

       2.763265838 seconds time elapsed

       2.544578000 seconds user
       0.204548000 seconds sys
Shrink `ParamEnv` and change `IntRange::from_pat`:
 Performance counter stats for 'cargo +perf-stage1 check':

          2,295.57 msec task-clock:u              #    0.996 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            49,959      page-faults:u             #    0.022 M/sec                  
     5,151,407,066      cycles:u                  #    2.244 GHz                    
       324,517,829      stalled-cycles-frontend:u #    6.30% frontend cycles idle   
     2,301,671,001      stalled-cycles-backend:u  #   44.68% backend cycles idle    
     8,130,868,329      instructions:u            #    1.58  insn per cycle         
                                                  #    0.28  stalled cycles per insn
     1,356,618,512      branches:u                #  590.972 M/sec                  
         7,323,800      branch-misses:u           #    0.54% of all branches        

       2.304509653 seconds time elapsed

       2.128090000 seconds user
       0.163909000 seconds sys

Previously, this method called the more general `pat_constructor`
function, which can return other pattern variants besides `IntRange`.
Then it throws away any non-`IntRange` variants. Specialize it so work
is only done when it could result in an `IntRange`.
@ecstatic-morse
Copy link
Contributor Author

@bors try
@rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Sep 27, 2020

⌛ Trying commit c4d8089 with merge baa01e8429de2c041f0095c01d4e8ca99c0b11a7...

@bors
Copy link
Contributor

bors commented Sep 27, 2020

☀️ Try build successful - checks-actions, checks-azure
Build commit: baa01e8429de2c041f0095c01d4e8ca99c0b11a7 (baa01e8429de2c041f0095c01d4e8ca99c0b11a7)

@rust-timer
Copy link
Collaborator

Queued baa01e8429de2c041f0095c01d4e8ca99c0b11a7 with parent 1ec980d, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (baa01e8429de2c041f0095c01d4e8ca99c0b11a7): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

@ecstatic-morse
Copy link
Contributor Author

ecstatic-morse commented Sep 27, 2020

(moved to summary)

Copy link
Member

@jackh726 jackh726 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) Super weird that this was the cause of the slowdown. (A large one at that!)

@vandenheuvel
Copy link
Contributor

Very interesting. Please let me know if I'm understanding your comment correctly: combining these two issues results in this PR having a net gain, but this doesn't address the issue underlying #76913? So this would not close #77058?

@ecstatic-morse
Copy link
Contributor Author

ecstatic-morse commented Sep 28, 2020

I'm not quite sure what you mean @vandenheuvel. I think I've gone as far as I care to. If you'd like to investigate further, feel free to keep #77058 open. Perhaps you can do even better.

Here's the "slow" assembly (ParamEnv shrunk with no other changes) in rustc_mir_build::thir::pattern::_match::PatStack::specialize_constructor() annotated via perf with the sum of StoreQueueRsrcStall and LoadQueueRsrcStall events (I'm on a Zen 2 Ryzen processor).

Excerpt from IntRange::from_pat (inlined)

  0.00        → callq  rustc_mir_build::thir::pattern::_match::pat_constructor
  0.00          mov    0xb0(%rsp),%cl
  0.02          mov    0xc0(%rsp),%r9
  0.20          mov    0xb8(%rsp),%rbp
  0.01          mov    0xd0(%rsp),%r13
  0.01          mov    0xc8(%rsp),%r8
  0.01          mov    0xd8(%rsp),%al
  0.01          mov    0xd9(%rsp),%edx
 32.37          mov    %edx,0x40(%rsp)
  0.00          mov    0xdc(%rsp),%edx
  0.00          mov    %edx,0x43(%rsp)
                cmp    $0x7,%cl 
jne    5a1      

Excerpt from (I think) IntRange::intersection (inlined)

         5a1:   mov    0x40(%rsp),%edx
 20.80          mov    0x43(%rsp),%esi
  0.03          mov    %esi,0x183(%rsp)
                mov    %edx,0x180(%rsp)
                mov    $0x1,%dl 
                cmp    $0x3,%cl 
jne    5d9      
                mov    0x180(%rsp),%ecx
  0.01          mov    0x183(%rsp),%edx
 23.64          mov    %edx,0x23(%rsp)
  0.12          mov    %ecx,0x20(%rsp)
                cmp    $0x2,%al 
                sete   %dl      
  0.07   5d9:   mov    0x20(%rsp),%eax
  0.01          mov    0x23(%rsp),%ecx
 18.20          mov    %ecx,0xb3(%rsp)
  0.01          mov    %eax,0xb0(%rsp)
                test   %dl,%dl  
  0.00jne    1546     
                mov    %r14,0x18(%rsp)
  0.00          mov    0x218(%rsp),%rax
  0.00          mov    (%rax),%rdi
  0.01          mov    0x30(%rbx),%rax
  0.01          mov    0x8(%rbx),%r14
  0.14          mov    0x10(%rbx),%rsi
                mov    0x20(%rbx),%rdx
                mov    0x18(%rbx),%r10
                movzbl (%rax),%ecx
                cmp    $0x2,%rcx
je     629      

We're repeatedly copying two words into local stack variables, then branching, then reading them again. This sequence has a lot of data hazards, which is clearly slowing down the pipeline significantly. I can't say why LLVM chose to generate code this way, or what causes the data hazards to disappear with my changes to IntRange::from_pat. You'd need someone with more LLVM expertise than I have. Perhaps building with debug-info-level=2 and stepping through it with a debugger would be helpful?

@simonvandel
Copy link
Contributor

I think the tag was meant for @vandenheuvel

@ecstatic-morse
Copy link
Contributor Author

ecstatic-morse commented Sep 28, 2020

I think the tag was meant for @vandenheuvel

Sorry!

@Mark-Simulacrum Mark-Simulacrum added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 28, 2020
@Mark-Simulacrum
Copy link
Member

r=me unless you're waiting on something further, not quite sure.

@ecstatic-morse
Copy link
Contributor Author

@bors r=Mark-Simulacrum

@bors
Copy link
Contributor

bors commented Sep 28, 2020

📌 Commit c4d8089 has been approved by Mark-Simulacrum

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 28, 2020
@bors
Copy link
Contributor

bors commented Sep 29, 2020

⌛ Testing commit c4d8089 with merge 48cab67...

@bors
Copy link
Contributor

bors commented Sep 29, 2020

☀️ Test successful - checks-actions, checks-azure
Approved by: Mark-Simulacrum
Pushing 48cab67 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Sep 29, 2020
@bors bors merged commit 48cab67 into rust-lang:master Sep 29, 2020
@rustbot rustbot added this to the 1.48.0 milestone Sep 29, 2020
@vandenheuvel
Copy link
Contributor

vandenheuvel commented Sep 30, 2020

@ecstatic-morse I don't have the expertise for further investigation either, unfortunately.

And why I asked whether #77058 would remain open: it seems that the root cause has not yet been found.

I wish I had a better idea of what was going on here. I'd like to prevent the regression from reappearing in the future due to small changes in unrelated code.

So to document that, it seems useful to keep #77058 open, perhaps after renaming it and adding another comment.

@ecstatic-morse
Copy link
Contributor Author

ecstatic-morse commented Sep 30, 2020

I don't think there's anything actionable left in #77058. If you disagree, you're free to reopen it (you should have the requisite permissions).

@ecstatic-morse ecstatic-morse deleted the optimize-int-range-from-pat branch October 6, 2020 01:42
@ecstatic-morse
Copy link
Contributor Author

Final perf results are in. Looks to be an improvement everywhere except unicode-normaliaztion and the match stress test, as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code generation issue blocks the removal of an unused struct field
8 participants