Skip to content

Conversation

@STRd6
Copy link
Contributor

@STRd6 STRd6 commented Oct 15, 2022

#5429 (comment)

I updated the SIMPLENUM regex in src/nodes.coffee and also the decimal parts of the number regex in lexer.coffee.

Also added a fix for numeric separators in decimal big int literals while I was in there.

@STRd6 STRd6 force-pushed the number-regexp-improve branch from 7278b11 to 41afdba Compare October 15, 2022 15:38
Copy link
Collaborator

@GeoffreyBooth GeoffreyBooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @edemaine and @Inve1951 ?

@edemaine
Copy link
Contributor

Looks good, and nice catch on the BigInt decimals.

I still prefer my regex \d+(?:_\d+)* from #5429 (comment) but this one is OK.

I wrote a small benchmark that confirms my intuition that my regex is a bit more performant, especially for larger strings.

@GeoffreyBooth
Copy link
Collaborator

Does \d+(?:_\d+)* pass all the tests? If so, is there any reason not to use it instead?

@STRd6
Copy link
Contributor Author

STRd6 commented Oct 17, 2022

@GeoffreyBooth that regex works for most places but causes catastrophic backtracking when added to BigInt. I think since the n comes after it keeps trying to match on it with every different combination of number lengths. The version without the + inside the * avoids this.

@edemaine https://jsbench.me/j7l9c2s0zl/1 it seems to only be faster for the SIMPLENUM test so maybe I'll update again with just that.

@edemaine
Copy link
Contributor

edemaine commented Oct 17, 2022

Fascinating! That n at the end makes a huge difference, relative to the regexes that don't have a trailing n. But I'm not sure it's the +s that are causing backtracking... it's the n itself.

I benchmarked among /^\d(?:_?\d)*n/, /^\d+(?:_\d+)*n/, and /^\d+(?:_?\d)*n/ and found that /^\d+(?:_\d+)*n/ (my original proposal) is fastest. Or did I miss something? Sadly, it's about 3x slower than /^\d+n/ that this PR replaces, but that's the price of supporting underscores.

(I did double-check the numeric separator spec to confirm that double underscore isn't allowed, and these are indeed the correct regexes.)

@Inve1951
Copy link
Contributor

From my experience catastrophic backtracking can be avoided with a negative lookahead (?!n). It causes the (dumb) regex engine to return early.
But I'm having a hard time following what the exact issue is here.

@STRd6
Copy link
Contributor Author

STRd6 commented Oct 24, 2022

From my testing changing the regex here: https://github.com/jashkenas/coffeescript/pull/5430/files#diff-b1724eb7967924ebd01486bf963f8281f5fff1acdf6f9354124c9ee50f0addcdR1313 caused the tests to never finish (I gave up after a few minutes when usually they only take six seconds). Changing the regexps on lines 1314 and 1315 seemed fine though. Please test it out on your local and let me know what you find in case I may be gravely mistaken.

@Inve1951
Copy link
Contributor

Oh I see. It should probably be non-greedy:

- \d(?:_?\d)*n
+ \d(?:_?\d)*?n

The difference here is that the first one starts testing from the end of the input, the second one from the start.
This could probably be changed in more places to gain some performance. (At the cost of readability)

@STRd6
Copy link
Contributor Author

STRd6 commented Oct 24, 2022

I tried some more testing but non-greedy didn't seem to help with the BigInt backtracking

changing to \d+(?:_?\d+)*?n still took forever on my local machine.

I'm happy with this as it is currently stands since it is close to the minimal amount of change and the performance seems in line with the existing compiler.

@GeoffreyBooth

@edemaine
Copy link
Contributor

@STRd6 I confirmed that changing that line to use \d+(?:_?\d+)*n makes cake test very slow (I terminated after a minute; my guess is it's quadratic time).

But my suggestion was \d+(?:_\d+)*n — note lack of ? after _. Then the benchmark runs slightly faster for me (~18 seconds instead of ~19 seconds, on an old CPU running Node 18.2). \d+(?:_\d+)*?n seemed to perform about the same as with greedy *.

@STRd6
Copy link
Contributor Author

STRd6 commented Oct 24, 2022

@edemaine Thanks for clarifying, I've updated the regexes to match and I think this is good for now.

@GeoffreyBooth
Copy link
Collaborator

Okay, so can everyone on this thread please confirm that the regex in this PR is the one we want to go with? 👍 or 👎

@GeoffreyBooth GeoffreyBooth merged commit 98a2331 into jashkenas:main Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants