-
Notifications
You must be signed in to change notification settings - Fork 579
Replaced SSGROW with SSCHECK inside regcppush (Win32 slowdown, recursive patterns) #20530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For anyone else reading this who isn't familiar with So if there is a need to push a lot of things to the save stack, I'm not familiar enough with the save stack to know what the best balance is. Both functions would seem to benefit from using |
On a second thought it should probably not grow so much thus we are probably leaving some stack somewhere I could only assumr. |
Maybe if we add a check on regex start end if the stack pointer is the same. |
Ah no it's fine - it's my code is ultra recursive:
Not perl issue - nevertheless in that specific scenario it's faster with this change. |
Perl/perl5#20530 is recommended though
This patch makes sense to me. Speculatively overallocating seems pretty reasonable to me. The growth factor of 1.5 seems a touch high, id expect something closer to 1.25 or something like that. What i don't understand is why the patch doesn't change SSGROW and instead introduces a new macro. If pre-overallocating is a good idea (i think it is) then why not do it always? |
@demerphq I have not introduced a new macro - the SSCHECK is an existing one. |
@AnFunctionArray sorry, misunderstood. I see. FWIW, and totally as an aside, both of these macros should probably change or be replaced. The api doesnt allow us to check for overflow and wraparound as far as I can tell. Consider what happens when total_elems + REGCP_FRAME_ELEMS wraps to 0. |
I'd tend to go towards making both entry points use exponential growth. I suspect this is is more an issue on Windows than on Linux - on Linux malloc() tends to allocate large allocations with mmap() and then use mremap() to extend them for realloc(), but I don't see any API support for that in the Windows API. |
@tonycoz That oughta be correct - I mainly tested on windows with this one - on mac for example no such issues were had. |
I think exponential growth needs to be bounded in a case like this. But otherwise I agree. Yves |
Bounded how? It's already bounded by the maximum size of the scope stack (32GB for 64-bit). If it switches from exponential growth to linear growth, then the cost goes from We already use exponential growth for PVs (in Perl_sv_grow) with a default of 1.25. |
On Mon, 21 Nov 2022 at 23:19, Tony Cook ***@***.***> wrote:
I think exponential growth needs to be bounded in a case like this. But
otherwise I agree.
Bounded how?
It's already bounded by the maximum size of the scope stack (32GB for
64-bit).
If it switches from exponential growth to linear growth, then the cost
goes from $O(N \log N)$ to $O(N^2)$.
Yeah, I know the math. Keep in mind that is an approximation of the most
extreme case as it grows to infinity, whereas in practice the operations
are always bounded (we have finite memory available), and the actual
scenario is much less costly than the worst case.
So for instance if we made it grow exponentially on a sliding scale,
eventually bottoming out at some (large) fixed chunk size then yes, the
very few programs out there that build 4GB strings using naive
concatenation might see something that feels like N*2 growth, but most
people would never notice. Eg, consider if we used an exponent of 4 until
we 1MB, then used an exponent of 2 until we hit 32MB, and then an exponent
of 1.25 until we hit 100MB, and then proceeded in chunks of 100MB (or
multiples thereof if the stack needs to grow by that amount). How many
*real* programs would be affected? And how much memory would be saved on
average? Would it be faster for most programs?
I'd argue that choosing an optimization that is potentially extremely
wasteful for many users to benefit the extreme few who actually possess
alternative ways to solve the problem (as far as string allocation goes) is
not a good trade off, whatever the raw math says underneath. For instance
those people who might actually notice that building 4GB strings by naive
concatenation could always preallocate the string and then fill it in
afterwards. We could even expose an API to make that easier.
As for the case of sizing the stack, exponential growth represents a
probabilistic analysis that for any size we reach we are likely to reach
some exponentially larger size still. This is obviously not true. The stack
doesnt grow unbounded, it eventually will level off, and if we were to use
a sizing strategey like above when it goes linear it would be growing in
very large chunks, and I doubt that anyone would notice that it is growing
linearly in the extreme case.
We already use exponential growth for PVs (in Perl_sv_grow) with a default
of 1.25.
Yeah, I know. That exponent is IMO too small when the string is very small,
and too large when the string is very large. That is my point. Using a
fixed exponent doesn't make as much sense in practice with bounded memory
as it does in the rawest of big O analysis.
In the worst case the exponent means we would fail to allocate the
requested memory, even though if we allocated less we would have enough
space to concatenate in the desired string.
Anyway, my point is that a naive single exponent is actually a poor choice.
Using a larger exponent when the size is small, and a smaller exponent when
it is large, and a fixed size when it is very large seems like a better
balance for our users, with few (if any) who would notice the change. I
also think that it is likely that the exponents should be different for
different purposes, and could take into account data like "how more are
being requested". For instance it might be better in some cases for the
stack to grow by a large exponent on the requested size, as compared to a
small exponent on the total size. Eg, consider if the stack is 100 elements
large, and we request 1000, we would end up 1100 * 1.25 = 1375. But maybe
it would be better to say "1.25 of the total, or twice the requested size",
in which case we would end up at 2200.
Anyway, this is a complex discussion. Using a fixed exponent is not
*wrong*, in the theoretical sense anyway, but it is not necessarily the
best balance in the practical sense. In practice there is a probabilistic
question of "when will growth stop" and "how large will we grow by each
step". We don't know the answers to these questions, and they will vary by
workload. We might end up making this toy program run faster
use integer;
my $s= "";
my $i=0;
$s.="x" while ++$i; # loop till integer rollover.
while burning memory such that our developers can only fit half the real
data in memory that they could if we werent so wastefully over-allocating.
cheers,
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
First a quick observation about the commit message. Please write messages
whose first line describes what part of perl is being changed. When
looking through a series of commits, seeing a commit like
Replaced SSGROW with SSCHECK
sounds like this has been done in all the perl core and maybe all the
bundled distributions. Something like
regcppush(): for more regex speed, use SSCHECK not SSGROW
tells the reader what's affected and why, without having to read the whole
commit.
Now, some general observations about SSGROW / SSCHECK in the regex engine.
regcppush() is used to save the current state of captures when pushing
a new regex state, and is only used for a small number of operations:
CURLYX, EVAL, GOSUB (IIRC).
It's a vestigial part of when the engine used to recurse (and thus saved its
state on the C stack). Some variable-sized stuff was saved on the save
stack instead.
With the addition of chains of regmatch_state blocks and removal of
recursion, it no longer makes sense to use the save stack.
I had a vague plan that such state saves could be done using a couple of
new struct regmatch_state union members. For simple patterns with a small
number of captures, all the capture info could be saved in a single block
that has space for the various vars like maxopenparen and a few capture
states; for larger numbers, additional bocks containing just (start,
start_tmp, end) triplets would be used in addition. The blocks would be
skipped / processed as appropriate during regex state unwinding.
…--
31 Dec 1661: "I have newly taken a solemne oath about abstaining from plays".
1 Jan 1662: "And after ... we went by coach to the play".
-- The Diary of Samuel Pepys
|
In my defence I'll say there weren't many instances of SSGROW in the source tree (IRC the only other instance was in scope.c). Also what do you mean by "With the addition of chains of regmatch_state blocks and removal of |
I was about to otherwise suggest querying the available memory - specifically for Win32 API (where the problem occurs). There are several APIs available in Windows (like https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-globalmemorystatusex). |
I am not sure if I fully understand what @iabyn means (I get it generally but I'm not sure I quite follow how this can work with EVAL or scenarios where we want to save "all the buffers"), but at a high level, the concept is that the information we put on the save stack can be stored in the regmatch_state blocks, and restored appropriately as those state blocks are unwound. Essentially the state stack unwinding is the thing that would trigger the restore operations in the first place, and if those save states contain the require data then there is no need to have a save stack. The general idea is that when you enter a new state you record the things that will change in the state record. So for instance the way pattern recursion is implemented is by having a "stop at close paren" (close_paren) variable, which when zero means "not in recursion", and when non-zero means, "when execution reaches the close paren with this number 'return' from the sub". When we enter recursion we store the old value of the variable in the state. So the initial recursion would store 0, and then change the global state var to be X. If further recursion occurs, we push a new state frame, and it would store X, and we would set the var to Y. Eventually we hit Every regex operation that can backtrack creates or updates the "current regexp state frame", and it is this data which allows backtracking to occur. For instance the TRIE code will find all the possible matches in a single "pass", create a list of them, and then store that list in its framedata so that when "the pattern to the right" fails and it needs to backtrack to one of the other matching options it can step through the list, when the list is exhausted it "fails" and unwinds the stack to backtrack even further. Things that can't backtrack dont add new frames, but may modify the topmost in some cases. I am glossing over a number of important details here, such as that the stack generally isn't unwound on success (GOSUB and EVAL are notable exceptions), only on failure, but at the same time there is a "success unwind" and a "failure unwind" which do different things. But if you think a bit about how regexps are defined to behave and spend some time with
Anyway, the idea that @iabyn has is that when we enter an open, or perform an operation that might affect the capture buffers we do the bookkeeping required to unwind it in the regexp state vector in the same way I describe above. If the frame doesnt get unwound we dont need to restore the old value, etc. etc. @iabyn hopefully can expand on some of the finer points himself. |
On Wed, Nov 23, 2022 at 09:12:33AM -0800, AnFunctionArray wrote:
Also what do you mean by "With the addition of chains of regmatch_state blocks and removal of
recursion, it no longer makes sense to use the save stack.".
Perl's regexp engine is implemented using backtracking. In something like
"aaaaab" =~ /^a*aab/,
then (ignoring optimisations), the a* will initially match 5 a's, then the
engine will fail to match "aab". It will backtrack, this time matching 4
a's and still failing on "aab"; then backtrack again, and this time, match
against 3 a's and 'aab'.
To do backtracking, the regex engine has to save its current state at any
decision point, so that it can be restored when backtracking.
Before my work for perl 5.10.0 (released in 2007) the regex engine
used to keep most of its state in local variables and recurse when trying
the next step. Backtracking involved returning from some levels of
S_regmatch() recursion, with the local variables being automatically
restored due to the fact that multiple copies of the variables were stored
on the CPU stack, one set per depth of recursion.
One problem with this was that for something like:
/(?: ...(.)...(.)...(.)...)*..../
The state to be stored for each depth of '*' recursion had to include
the current matching state for each of $1,$2,$3. Since patterns have a
variable (and not capped) number of captures, this state couldn't be
stored in a fixed number of local variables.
So instead, the regex engine (mis)used the save stack as a sort of
temporary storage to store a variable-sized chunk of data for each new
level of recursion. This was only needed for the few regex op types, like
CURLYX/WHILEM, which needed to save and possibly restore all the current
capture info.
In perl-5.10.0, I changed the regex engine from recursive to iterative. So
rather than storing the current state on the CPU stack (in the form of
function local variables), it now saves state in malloc()ed chunks of
memory. This has an important advantage in that, for long strings, the
old regex engine could recurse deeply and blow the CPU stack (especially
in threaded environments, where each thread's max stack size was small by
default).
In particular, I created a structure called regmatch_state (which is
mainly a union) to save the current state for each type of regex op.
The regex engine would allocate a block of memory which could hold an
array of regmatch_state's. When that became full, a new block would be
malloc()ed and the blocks chained together. (These blocks are struct
regmatch_slab).
However, I left the capture state saving alone, so it is being done on the
save stack, even though I knew it wasn't ideal. My intention was to
revisit at some point, and instead of saving capture state on the save
stack, save it by allocating one or more extra regmatch_state structs to
hold the capture indices at the same time as pushing a WHILEM-type
regmatch_state.
I'm not saving that you need to do this; I was just mentioning it as a
general observation in case you or anyone else intended to do any further
work in this area.
…--
"You may not work around any technical limitations in the software"
-- Windows Vista license
|
@iabyn the thing i dont get is how you would handle a vector of parens? Eg, if we recurse into a sub pattern, we want to save the paren's it touches, not all the parens in the parent (iiuir). Would you keep using the save stack for eval/gosub and just use your alternative for things where we can always preserve the old values safely? Or am i mistaken in thinking they are different cases and I am missing something? |
On Thu, Nov 24, 2022 at 07:24:34AM -0800, Yves Orton wrote:
@iabyn the thing i dont get is how you would handle a vector of parens?
Eg, if we recurse into a sub pattern, we want to save the paren's it
touches, not all the parens in the parent (iiuir). Would you keep using
the save stack for eval/gosub and just use your alternative for things
where we can always preserve the old values safely? Or am i mistaken in
thinking they are different cases and I am missing something?
The idea is that the code will save as many parens as need saving by
allocating a suitable number of regmatch_state blocks. For a small number
of parens a single regmatch_state would be used; for larger, use multiple
structs. There would be two new types of union within the regmatch_state
struct: one that holds maxopenparen, lastcloseparen etc plus a small
array of (start, start_tmp, end) triplets. The number would be chosen to
make this element of the union no larger than the existing largest union
member. Then there would be second new union type, not containing
maxopenparen et, but holding a bigger fixed set of triplets. S_regcppush()
or its equivalent would push at least one new regmatch_state block, and if
there were more than a few parens needing saving, additional blocks as
required.
My intention is that the save stack would not be used for *anything* in
the regex engine, and functions like S_regcppush() (no matter whether
called from EVAL or GOSUB or WHILEM), would save capture state using
regmatch_state blocks rather than the save stack.
…--
Counsellor Troi states something other than the blindingly obvious.
-- Things That Never Happen in "Star Trek" #16
|
I see, thank @iabyn it makes sense to me now. That was the bit that eluded me. |
@AnFunctionArray were you going to push a fix for the commit message? |
I might. Done. |
Observed is huge slow-down on Win32 machines with many leveled nested recursive patterns. Bottleneck identified to be SSGROW reallocations inside regcppush. Replacing with SSCHECK yields huge performance gains (to the extened for it to be called a "fix") on said platform.
7507941
to
a413cad
Compare
On Tue, Feb 07, 2023 at 05:23:04AM -0800, Yves Orton wrote:
Merged #20530 into blead.
This patch is flawed.
by replacing SSGROW(n) - which grows the savestack by n,
with SSCHECK(n) - which grows the savestack by 50% and ignores n,
it's possible to overrun the end of the savestack when the stack is small
and the number of captures to save is large.
For example, the following code fails under valgrind:
$_ = "x" x 1000;
my $pat = '(.)' x 200;
$pat = qr/($pat)+/;
m/$pat/;
…--
Never work with children, animals, or actors.
|
@iabyn ack. im looking at it now. |
Prior to this patch SSCHECK() took a "needs" parameter, but did not actually guarantee that the stack would be sufficiently large to accomodate that many elements after the call. This was quite misleading. Especially as SSGROW() would not do geometric preallocation, but SSCHECK() would, so much of the time SSCHECK() would appear to be a better choice, but not always. This patch makes it so SSCHECK() is an alias for SSGROW(), and it makes it so that SSGROW() also geometrically overallocates. The underlying function that used to implement SSCHECK() savestack_grow() now calls the savestack_grow_cnt() which has always implemented SSGROW(). Anything in the internals that used to call SSCHECK() now calls SSGROW() instead. At the same time the preallocation has been made a little bit more aggressive, ensuring that we always allocate at least SS_MAXPUSH elements on top of what was requested as part of the "known" size of the stack, and additional SS_MAXPUSH elements which are not part of the "known" size of the stack. This "hidden extra" is used to simply some of the other macros which are used a lot internally. (I have beefed up the comment explaining this as well.) See #20530 where Dave M points out the problem with this logic. Currently I don't think we have a good way to test for the error that he showed so unfortunately this patch lacks a test.
Which drastically improves performance on large matches.
donuts.zip
For the following file for example:
(On AnFunctionArray/cperllexer@5980b21)
With the following env vars:
MINQUEUECOUNT=50
SILENT=1
Where the second run is without the patch.