Skip to content

Replaced SSGROW with SSCHECK inside regcppush (Win32 slowdown, recursive patterns) #20530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 7, 2023

Conversation

AnFunctionArray
Copy link
Contributor

Which drastically improves performance on large matches.

donuts.zip

For the following file for example:

(On AnFunctionArray/cperllexer@5980b21)

With the following env vars:

MINQUEUECOUNT=50
SILENT=1

Z:\cllvmbackend\lexer>powershell -Command "Measure-Command {perl ./parse.pl ./donuts.pp}"
joinning 30, 834632 for real


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 471
Ticks             : 4713919
TotalDays         : 5.45592476851852E-06
TotalHours        : 0.000130942194444444
TotalMinutes      : 0.00785653166666667
TotalSeconds      : 0.4713919
TotalMilliseconds : 471.3919




Z:\cllvmbackend\lexer>powershell -Command "Measure-Command {Z:\perl5orig\perl.exe ./parse.pl ./donuts.pp}"
joinning 30, 834632 for real


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 25
Milliseconds      : 489
Ticks             : 254896534
TotalDays         : 0.000295019136574074
TotalHours        : 0.00708045927777778
TotalMinutes      : 0.424827556666667
TotalSeconds      : 25.4896534
TotalMilliseconds : 25489.6534




Z:\cllvmbackend\lexer>

Where the second run is without the patch.

@jkeenan
Copy link
Contributor

jkeenan commented Nov 19, 2022

The intent of this p.r. appears to be performance improvement. @iabyn, @demerphq can you take a look? Thanks.

@richardleach
Copy link
Contributor

For anyone else reading this who isn't familiar with SSGROW and SSCHECK, the difference boils down to growing the save stack from a size of PL_savestack_ix to a size of:
SSGROW: PL_savestack_ix + need
SSCHECK: PL_savestack * 3/2

So if there is a need to push a lot of things to the save stack, SSGROW will trigger a lot of reallocations for tiny size increases, but (in theory) not have the stack allocating a lot more memory than is needed. In comparison, by resizing by much more at a time. SSCHECK will need to do fewer reallocations but could result in the save stack being allocated a good deal more memory (in the worst case) than it actually needs.

I'm not familiar enough with the save stack to know what the best balance is.

Both functions would seem to benefit from using malloc_usable_size etc. where the malloc() implementation supports it. For other cases, perhaps SSGROW could be somehow tuned to better appreciate the actual allocation sizes it's going to be given. (IIRC modern languages (e.g. Go) and browsers (certainly Firefox) often go for power-of-two allocations in general and find that works well for them.)

@AnFunctionArray
Copy link
Contributor Author

On a second thought it should probably not grow so much thus we are probably leaving some stack somewhere I could only assumr.

@AnFunctionArray
Copy link
Contributor Author

Maybe if we add a check on regex start end if the stack pointer is the same.

@AnFunctionArray
Copy link
Contributor Author

Ah no it's fine - it's my code is ultra recursive:

(?<globaloutter>(?<globalinner>(?&fasterdecls))*+((?&globalinner)(?&globaloutter))?+)

Not perl issue - nevertheless in that specific scenario it's faster with this change.

AnFunctionArray added a commit to AnFunctionArray/cperllexer that referenced this pull request Nov 20, 2022
Perl/perl5#20530 is recommended though
@demerphq
Copy link
Collaborator

This patch makes sense to me. Speculatively overallocating seems pretty reasonable to me. The growth factor of 1.5 seems a touch high, id expect something closer to 1.25 or something like that. What i don't understand is why the patch doesn't change SSGROW and instead introduces a new macro. If pre-overallocating is a good idea (i think it is) then why not do it always?

@AnFunctionArray
Copy link
Contributor Author

@demerphq I have not introduced a new macro - the SSCHECK is an existing one.

@demerphq
Copy link
Collaborator

@AnFunctionArray sorry, misunderstood. I see. FWIW, and totally as an aside, both of these macros should probably change or be replaced. The api doesnt allow us to check for overflow and wraparound as far as I can tell. Consider what happens when total_elems + REGCP_FRAME_ELEMS wraps to 0.

@tonycoz
Copy link
Contributor

tonycoz commented Nov 21, 2022

I'd tend to go towards making both entry points use exponential growth.

I suspect this is is more an issue on Windows than on Linux - on Linux malloc() tends to allocate large allocations with mmap() and then use mremap() to extend them for realloc(), but I don't see any API support for that in the Windows API.

@AnFunctionArray
Copy link
Contributor Author

AnFunctionArray commented Nov 21, 2022

@tonycoz That oughta be correct - I mainly tested on windows with this one - on mac for example no such issues were had.

@demerphq
Copy link
Collaborator

I'd tend to go towards making both entry points use exponential growth.

I think exponential growth needs to be bounded in a case like this. But otherwise I agree.

Yves

@tonycoz
Copy link
Contributor

tonycoz commented Nov 21, 2022

I think exponential growth needs to be bounded in a case like this. But otherwise I agree.

Bounded how?

It's already bounded by the maximum size of the scope stack (32GB for 64-bit).

If it switches from exponential growth to linear growth, then the cost goes from $O(N \log N)$ to $O(N^2)$.

We already use exponential growth for PVs (in Perl_sv_grow) with a default of 1.25.

@demerphq
Copy link
Collaborator

demerphq commented Nov 22, 2022 via email

@iabyn
Copy link
Contributor

iabyn commented Nov 22, 2022 via email

@AnFunctionArray
Copy link
Contributor Author

AnFunctionArray commented Nov 23, 2022

@iabyn

In my defence I'll say there weren't many instances of SSGROW in the source tree (IRC the only other instance was in scope.c).

Also what do you mean by "With the addition of chains of regmatch_state blocks and removal of
recursion, it no longer makes sense to use the save stack."?

@AnFunctionArray
Copy link
Contributor Author

I was about to otherwise suggest querying the available memory - specifically for Win32 API (where the problem occurs).

There are several APIs available in Windows (like https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-globalmemorystatusex).

@demerphq
Copy link
Collaborator

demerphq commented Nov 24, 2022

Also what do you mean by "With the addition of chains of regmatch_state blocks and removal of
recursion, it no longer makes sense to use the save stack."?

I am not sure if I fully understand what @iabyn means (I get it generally but I'm not sure I quite follow how this can work with EVAL or scenarios where we want to save "all the buffers"), but at a high level, the concept is that the information we put on the save stack can be stored in the regmatch_state blocks, and restored appropriately as those state blocks are unwound. Essentially the state stack unwinding is the thing that would trigger the restore operations in the first place, and if those save states contain the require data then there is no need to have a save stack.

The general idea is that when you enter a new state you record the things that will change in the state record. So for instance the way pattern recursion is implemented is by having a "stop at close paren" (close_paren) variable, which when zero means "not in recursion", and when non-zero means, "when execution reaches the close paren with this number 'return' from the sub". When we enter recursion we store the old value of the variable in the state. So the initial recursion would store 0, and then change the global state var to be X. If further recursion occurs, we push a new state frame, and it would store X, and we would set the var to Y. Eventually we hit CLOSE #Y, which causes us to unwind the state stack, so extract the old value, so for instance restore the var back to X, and continue matching at the appropriate regop from the "caller". Later when we it the CLOSE #X, we again unwind the state stack, set the close to 0, and continue with the top level pattern. (NOTE: I reviewed the code after i wrote the above, and while what I said above about a global variable (in the execution engine) /was/ true for the open/close paren, and is /generally/ true of the state operations, it turns out that this particular case was optimized and the variable just lives in the regexp state stack, there is no global var for this one anymore, but for most cases what is described here is correct. close_paren is exceptional however and we do not use an explicit global var for it and instead just use the most recent stack frame, but most cases are as I described. Search for EVAL_CLOSE_PAREN in regexp.c)

Every regex operation that can backtrack creates or updates the "current regexp state frame", and it is this data which allows backtracking to occur. For instance the TRIE code will find all the possible matches in a single "pass", create a list of them, and then store that list in its framedata so that when "the pattern to the right" fails and it needs to backtrack to one of the other matching options it can step through the list, when the list is exhausted it "fails" and unwinds the stack to backtrack even further. Things that can't backtrack dont add new frames, but may modify the topmost in some cases.

I am glossing over a number of important details here, such as that the stack generally isn't unwound on success (GOSUB and EVAL are notable exceptions), only on failure, but at the same time there is a "success unwind" and a "failure unwind" which do different things. But if you think a bit about how regexps are defined to behave and spend some time with use re DEBUG=>'ALL', which IIRC should show the stack operations during matching, it will come clear. Here is an abbreviated example of what you might see.

$ perl -Mre=Debug,EXECUTE,STACK,STATE -e'"aaba"=~/(?<a>a)(?<b>(?&a)b(?&a))/'
Matching REx "(?<a>a)(?<b>(?&a)b(?&a))" against "aaba"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [0..4] gave 0
  Found anchored substr "aaba" at offset 0 (rx_origin now 0)...
  (multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
                             |   0| Setting an EVAL scope, savestack=3,
   0 <> <aaba>               |   0| 1:OPEN1 'a'(3)
   0 <> <aaba>               |   0| 3:EXACT <a>(5)
   1 <a> <aba>               |   0| 5:CLOSE1 'a'(7)
   1 <a> <aba>               |   0| 7:OPEN2 'b'(9)
   1 <a> <aba>               |   0| 9:GOSUB1[-8:1] 'a'(12)
                             |   0| entering GOSUB, prev_recurse_locinput=0 recurse_locinput[1]=564b45c3feb1
                             |   0| Setting an EVAL scope, savestack=15,
                             |   0| push #0   EVAL_postponed_AB yes
   1 <a> <aba>               |   1|  1:OPEN1 'a'(3)
   1 <a> <aba>               |   1|  3:EXACT <a>(5)
   2 <aa> <ba>               |   1|  5:CLOSE1 'a'(7)
                             |   1|  FAKE-END[before] GOSUB1 ce=564b45c28d20 recurse_locinput=0
                             |   1|  Setting an EVAL scope, savestack=25,
                             |   1|  END: EVAL trying tail ... (cur_eval=0)
                             |   1|  push #1   EVAL_postponed_AB yes
                             |   1|       #0   EVAL_postponed_AB yes
   2 <aa> <ba>               |   2|   12:EXACT <b>(14)
   3 <aab> <a>               |   2|   14:GOSUB1[-13:1] 'a'(17)
                             |   2|   entering GOSUB, prev_recurse_locinput=0 recurse_locinput[1]=564b45c3feb3
                             |   2|   Setting an EVAL scope, savestack=35,
                             |   2|   push #2   EVAL_postponed_AB yes
                             |   2|        #1   EVAL_postponed_AB yes
                             |   2|        #0   EVAL_postponed_AB yes
   3 <aab> <a>               |   3|    1:OPEN1 'a'(3)
   3 <aab> <a>               |   3|    3:EXACT <a>(5)
   4 <aaba> <>               |   3|    5:CLOSE1 'a'(7)
                             |   3|    FAKE-END[before] GOSUB1 ce=564b45c28e40 recurse_locinput=0
                             |   3|    Setting an EVAL scope, savestack=45,
                             |   3|    END: EVAL trying tail ... (cur_eval=0)
                             |   3|    push #3   EVAL_postponed_AB yes
                             |   3|         #2   EVAL_postponed_AB yes
                             |   3|         #1   EVAL_postponed_AB yes
   4 <aaba> <>               |   4|     17:CLOSE2 'b'(19)
   4 <aaba> <>               |   4|     19:END(0)
   4 <aaba> <>               |   4|     pop (yes) EVAL_postponed_AB[Y]
                             |   3|    EVAL_AB cur_eval=0 prev_eval=564b45c28e40
                             |   3|    EVAL_AB[after] GOSUB1 ce=564b45c28e40 recurse_locinput=564b45c3feb3
   4 <aaba> <>               |   3|    pop (yes) EVAL_postponed_AB[Y]
                             |   2|   EVAL_AB cur_eval=564b45c28e40 prev_eval=0
                             |   2|   EVAL_AB[before] GOSUB1 ce=564b45c28e40 recurse_locinput=0
   4 <aaba> <>               |   2|   pop (yes) EVAL_postponed_AB[Y]
                             |   1|  EVAL_AB cur_eval=0 prev_eval=564b45c28d20
                             |   1|  EVAL_AB[after] GOSUB1 ce=564b45c28d20 recurse_locinput=564b45c3feb1
   4 <aaba> <>               |   1|  pop (yes) EVAL_postponed_AB[Y]
                             |   0| EVAL_AB cur_eval=564b45c28d20 prev_eval=0
                             |   0| EVAL_AB[before] GOSUB1 ce=564b45c28d20 recurse_locinput=0
Match successful!

Anyway, the idea that @iabyn has is that when we enter an open, or perform an operation that might affect the capture buffers we do the bookkeeping required to unwind it in the regexp state vector in the same way I describe above. If the frame doesnt get unwound we dont need to restore the old value, etc. etc. @iabyn hopefully can expand on some of the finer points himself.

@iabyn
Copy link
Contributor

iabyn commented Nov 24, 2022 via email

@demerphq
Copy link
Collaborator

@iabyn the thing i dont get is how you would handle a vector of parens? Eg, if we recurse into a sub pattern, we want to save the paren's it touches, not all the parens in the parent (iiuir). Would you keep using the save stack for eval/gosub and just use your alternative for things where we can always preserve the old values safely? Or am i mistaken in thinking they are different cases and I am missing something?

@iabyn
Copy link
Contributor

iabyn commented Nov 24, 2022 via email

@demerphq
Copy link
Collaborator

The idea is that the code will save as many parens as need saving by allocating a suitable number of regmatch_state blocks

I see, thank @iabyn it makes sense to me now. That was the bit that eluded me.

@demerphq
Copy link
Collaborator

demerphq commented Dec 6, 2022

@AnFunctionArray were you going to push a fix for the commit message?

@AnFunctionArray
Copy link
Contributor Author

AnFunctionArray commented Dec 6, 2022

@AnFunctionArray were you going to push a fix for the commit message?

I might.

Done.

Observed is huge slow-down on Win32 machines
with many leveled nested recursive patterns.
Bottleneck identified to be SSGROW reallocations
inside regcppush.

Replacing with SSCHECK yields huge performance gains
(to the extened for it to be called a "fix") on said platform.
@AnFunctionArray AnFunctionArray changed the title Replaced SSGROW with SSCHECK Replaced SSGROW with SSCHECK inside regcppush (Win32 slowdown, recursive patterns) Dec 6, 2022
@demerphq demerphq merged commit 92373de into Perl:blead Feb 7, 2023
@iabyn
Copy link
Contributor

iabyn commented Feb 19, 2023 via email

@demerphq
Copy link
Collaborator

@iabyn ack. im looking at it now.

demerphq added a commit that referenced this pull request Feb 19, 2023
Prior to this patch SSCHECK() took a "needs" parameter, but did not
actually guarantee that the stack would be sufficiently large to
accomodate that many elements after the call. This was quite misleading.
Especially as SSGROW() would not do geometric preallocation, but
SSCHECK() would, so much of the time SSCHECK() would appear to be a
better choice, but not always.

This patch makes it so SSCHECK() is an alias for SSGROW(), and it makes
it so that SSGROW() also geometrically overallocates. The underlying
function that used to implement SSCHECK() savestack_grow() now calls the
savestack_grow_cnt() which has always implemented SSGROW(). Anything
in the internals that used to call SSCHECK() now calls SSGROW() instead.

At the same time the preallocation has been made a little bit more
aggressive, ensuring that we always allocate at least SS_MAXPUSH
elements on top of what was requested as part of the "known" size of the
stack, and additional SS_MAXPUSH elements which are not part of the
"known" size of the stack. This "hidden extra" is used to simply some of
the other macros which are used a lot internally. (I have beefed up the
comment explaining this as well.)

See #20530 where Dave M points out the
problem with this logic. Currently I don't think we have a good way to
test for the error that he showed so unfortunately this patch lacks a
test.
@demerphq
Copy link
Collaborator

@iabyn hopefully fixed in #20824

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants