Skip to content

(Optimization:Regexec) Replaced recppush/regcppop with memcpy #19814

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

AnFunctionArray
Copy link
Contributor

Improved performance of complex recursive regular expressions with a lot of capture groups.

Determined bottleneck with Intel VTune and fixed by using faster built-in memcpy instead of a loop.

Currently in my test cases performance seems to be 2x.

@AnFunctionArray
Copy link
Contributor Author

AnFunctionArray commented Jun 3, 2022

@khwilliamson Does that mean the tests failing are for author only - I thought there were alignment issues in 64 bit.

That's why I closed this by myself. Anyway I'll be happy if this actually gets accepted.

Copy link
Contributor

@hvds hvds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general idea here, but more care is needed in the calculations - I think you need to retain the original calculation of paren_elems_to_push, and convert that to units of PL_savestack elements in a separate variable.

@hvds
Copy link
Contributor

hvds commented Jun 9, 2022

I think it would also be useful to reorder things slightly for clarity: in regcppush have the SSGROW() immediately next to the memcpy and SSPUSHxxx calls (and correspondingly for regcppop), and have the complex DEBUG_BUFFERS_r() chunk either before or after those.

@AnFunctionArray
Copy link
Contributor Author

AnFunctionArray commented Jun 9, 2022

Can I get tests run please - I think nothing brakes with this (and I hope the debug prints are alright although I believe there aren't tests for that)?

@AnFunctionArray AnFunctionArray requested a review from hvds June 9, 2022 17:24
@AnFunctionArray AnFunctionArray requested a review from hvds June 16, 2022 12:57
Copy link
Contributor

@hvds hvds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you've addressed the latest comments, it would probably be a good time to squash the commits and rebase.

regexec.c Outdated
Comment on lines 238 to 240
/* static assert that offs struc size is not less than stack elem size */

sizeof(char[sizeof(*rex->offs) >= sizeof(*PL_savestack) ? 1 : -1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See STATIC_ASSERT_DECL and STATIC_ASSERT_STMT in perl.h

This should ideally be used at the point we rely on it for the calculation in regcppop, something like:

     /* calculate number of offs/capture groups stored */
+    STATIC_ASSERT_STMT(sizeof(*rex->offs) >= sizeof(*PL_savestack));
     i = (i * sizeof(*PL_savestack)) / sizeof(*rex->offs);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to move to reccppop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I marked this unresolved last time, but it still needs to move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a static assert I can move it if you like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvds Ok moved it as requested.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, can it be made into a compile time assert somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, can it be made into a compile time assert somehow?

Yes, per my comment above: "See STATIC_ASSERT_DECL and STATIC_ASSERT_STMT in perl.h".

@AnFunctionArray AnFunctionArray requested a review from hvds June 17, 2022 17:01
@AnFunctionArray AnFunctionArray force-pushed the regpuhpopopt branch 2 times, most recently from f995eb3 to 1250ae4 Compare June 20, 2022 13:51
Copy link
Contributor

@hvds hvds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remain nervous about the reliance on relative structure sizes, but I can live with this if someone else agrees that they're happy with it.

To recap: in regcppush this change packs the capture groups to save (sizeof(*rex->offs) each, currently equal to 3 * sizeof(SSize_t)) into a calculated number of savestack slots (sizeof(*PL_savestack) each, currently sizeof(ANY)). In regcppop we then discover how many capture groups should be restored by doing the reverse calculation - which can get the wrong result if there was enough spare space in the final savestack slot to fit another one. I.e. it can go wrong if we ever have sizeof(*PL_savestack) > sizeof(*rex->offs).

That risk is now managed by a static assert:

    STATIC_ASSERT_STMT(sizeof(*rex->offs) >= sizeof(*PL_savestack));

My remaining concern is that if the assert ever fails on some platform, we have no good way to fix it for them. @tonycoz, @iabyn, @leonerd, anyone want to weigh in for or against?

@khwilliamson khwilliamson requested a review from iabyn June 21, 2022 02:35
@tonycoz
Copy link
Contributor

tonycoz commented Jun 21, 2022

I don't see why the static assert is needed, if we assume for argument that Size_t is 1 byte (making sizeof(regexp_paren_pair) == 3) and sizeof(*PL_savestack) is 8, then the calculation of paren_elems_to_push is still the correct 1.

I don't think the assertion can fail currently, though that would change if we added NV to union any (__float128 on 32-bit systems)

@hvds
Copy link
Contributor

hvds commented Jun 21, 2022

I don't see why the static assert is needed, if we assume for argument that Size_t is 1 byte (making sizeof(regexp_paren_pair) == 3) and sizeof(*PL_savestack) is 8, then the calculation of paren_elems_to_push is still the correct 1.

I don't think the assertion can fail currently, though that would change if we added NV to union any (__float128 on 32-bit systems)

@tonycoz The issue is with the reverse calculation in recppop (which is why I was at pains to have the static assert moved to the correct place):

    /* calculate number of offs/capture groups stored */
    i = (i * sizeof(*PL_savestack)) / sizeof(*rex->offs);

With your example sizes, this would calculate 2 as the number of paren groups to restore, regardless of whether 1 or 2 paren groups had been saved in regcppush.

@AnFunctionArray
Copy link
Contributor Author

If I need to do anything - please tell me - I would like this to be pushed forward personally.

@hvds
Copy link
Contributor

hvds commented Jul 28, 2022

If I need to do anything - please tell me - I would like this to be pushed forward personally.

I'm hoping that at some point @tonycoz will get a chance to look at my most recent response. If @iabyn gets a chance to look at the PR as well, that would be super.

@tonycoz
Copy link
Contributor

tonycoz commented Jul 28, 2022

If I need to do anything - please tell me - I would like this to be pushed forward personally.

I'm hoping that at some point @tonycoz will get a chance to look at my most recent response. If @iabyn gets a chance to look at the PR as well, that would be super.

Sorry, you were so obviously correct I didn't think I needed to respond.

My comment about whether the assertion can fail still applies - only if we add NV or another similarly large type to union any. Note that perl.h has provision to allow an external definition of union any, though I doubt it has seen any recent use.

@iabyn
Copy link
Contributor

iabyn commented Aug 2, 2022 via email

Copy link
Collaborator

@demerphq demerphq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least this could use some more detailed explanation in the comments why it is correct. I had to stare at a couple of parts to understand what it is doing.

Basically tested with Intel VTune to be increasing the performance of a perl regex matching program with multiple capture groups and recursive patterns
@tonycoz
Copy link
Contributor

tonycoz commented Oct 30, 2022

It looks reasonable to me at this point.

I did a performance test using the code from the reproduction of #20411 (with a smaller source file).

It made a huge difference in the number of instructions, from 9.8e9 to 2.6e9 instructions, from 5e6 branch misses to 3.6e6 branch misses.

raw perf output
blead:

Performance counter stats for '../../perl5/perl -I/home/tony/dev/perl/git/perl5/lib parse.pl ../shorter.pp':

          1,334.39 msec task-clock                #    0.998 CPUs utilized          
                 5      context-switches          #    0.004 K/sec                  
                 1      cpu-migrations            #    0.001 K/sec                  
           122,041      page-faults               #    0.091 M/sec                  
     6,244,474,923      cycles                    #    4.680 GHz                    
     9,844,500,190      instructions              #    1.58  insn per cycle         
       847,094,914      branches                  #  634.816 M/sec                  
         5,295,250      branch-misses             #    0.63% of all branches        

       1.336585169 seconds time elapsed

       1.203539000 seconds user
       0.131949000 seconds sys


Performance counter stats for '../../perl5/perl -I/home/tony/dev/perl/git/perl5/lib parse.pl ../shorter.pp':

          1,352.82 msec task-clock                #    1.000 CPUs utilized          
                 4      context-switches          #    0.003 K/sec                  
                 1      cpu-migrations            #    0.001 K/sec                  
           122,047      page-faults               #    0.090 M/sec                  
     6,269,318,583      cycles                    #    4.634 GHz                    
     9,847,125,497      instructions              #    1.57  insn per cycle         
       847,748,958      branches                  #  626.652 M/sec                  
         5,119,632      branch-misses             #    0.60% of all branches        

       1.353153777 seconds time elapsed

       1.208993000 seconds user
       0.144118000 seconds sys
 Performance counter stats for '../../perl5/perl -I/home/tony/dev/perl/git/perl5/lib parse.pl ../shorter.pp':

          1,331.40 msec task-clock                #    1.000 CPUs utilized          
                 3      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           122,048      page-faults               #    0.092 M/sec                  
     6,218,556,628      cycles                    #    4.671 GHz                    
     9,844,846,499      instructions              #    1.58  insn per cycle         
       847,218,356      branches                  #  636.337 M/sec                  
         5,008,442      branch-misses             #    0.59% of all branches        

       1.331663734 seconds time elapsed

       1.211690000 seconds user
       0.119969000 seconds sys

19814:

 Performance counter stats for '../../perl5/perl -I/home/tony/dev/perl/git/perl5/lib parse.pl ../shorter.pp':

            449.84 msec task-clock                #    0.917 CPUs utilized          
                 2      context-switches          #    0.004 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           122,019      page-faults               #    0.271 M/sec                  
     2,070,819,713      cycles                    #    4.603 GHz                    
     2,567,054,451      instructions              #    1.24  insn per cycle         
       510,284,522      branches                  # 1134.359 M/sec                  
         3,573,565      branch-misses             #    0.70% of all branches        

       0.490348496 seconds time elapsed

       0.333654000 seconds user
       0.116578000 seconds sys

Performance counter stats for '../../perl5/perl -I/home/tony/dev/perl/git/perl5/lib parse.pl ../shorter.pp':

            436.59 msec task-clock                #    0.999 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           122,017      page-faults               #    0.279 M/sec                  
     2,043,129,622      cycles                    #    4.680 GHz                    
     2,567,131,711      instructions              #    1.26  insn per cycle         
       510,323,315      branches                  # 1168.880 M/sec                  
         3,630,996      branch-misses             #    0.71% of all branches        

       0.436871340 seconds time elapsed

       0.344714000 seconds user
       0.092191000 seconds sys
Performance counter stats for '../../perl5/perl -I/home/tony/dev/perl/git/perl5/lib parse.pl ../shorter.pp':

            438.77 msec task-clock                #    0.999 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           122,020      page-faults               #    0.278 M/sec                  
     2,054,471,588      cycles                    #    4.682 GHz                    
     2,565,659,212      instructions              #    1.25  insn per cycle         
       509,973,832      branches                  # 1162.293 M/sec                  
         3,678,948      branch-misses             #    0.72% of all branches        

       0.439053740 seconds time elapsed

       0.314207000 seconds user
       0.124877000 seconds sys

@demerphq
Copy link
Collaborator

demerphq commented Nov 3, 2022

@khwilliamson @tonycoz @demerphq - what do we need to get this merged finally?

@demerphq
Copy link
Collaborator

demerphq commented Nov 3, 2022

I see we have ack already, and @tonycoz seems happy, ill do a further review and assuming im happy with it ill merge it.

@demerphq
Copy link
Collaborator

demerphq commented Nov 3, 2022

This was merged as 66964e1

I made two whitespace changes to the commit, one to the commit text and one to remove a spurious whitespace change, and since there had been so much chatter on this before I decided not to ask @AnFunctionArray to do it himself. I then manually pushed to blead.

Closing this as merged.

@demerphq demerphq closed this Nov 3, 2022
@demerphq
Copy link
Collaborator

demerphq commented Nov 3, 2022

BTW, thanks for the patch!

@AnFunctionArray
Copy link
Contributor Author

BTW, thanks for the patch!

Well no problem - I've definitely had fun doing this after my crazy shenanigans - trying to add a way to preserve capture groups after recursion of a pattern (by adding new syntax) - but this and added multithreading plus some micro optimisations (like removing (??{})) - which are by nature slow (because of the pattern recompiling each time) - kinda did the same thing for my project and I'm happy :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants