Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster matching of "empty-matching" patterns (preview) #287

Closed
genivia-inc opened this issue Aug 28, 2023 · 1 comment
Closed

Faster matching of "empty-matching" patterns (preview) #287

genivia-inc opened this issue Aug 28, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@genivia-inc
Copy link
Member

This is perhaps an interesting case, but not a common one, to optimize.

An "empty-matching regex pattern" permits zero-length input to be matched. For example, a?b?c? and a*b*c* match a, b and c in various forms in the input, but also matches "nothingness" (empty space between characters).

In fact, I bet that most grep users won't use these patterns much, if ever, because every line in the input is returned as a match anyway. With option -o it has some use to find specific matches though, so there is a use case to optimize.

Ugrep has options to control the matching behavior when empty-matching patterns are used. Option -Y permits empty matches, like GNU grep does. So when ugrep is used as a grep replacement (hard/soft link or alias), then option -Y is enabled to emulate GNU grep. This is not efficient at all yet (at this time of writing) with ugrep to match such patterns.

Actually, GNU grep does kind-of strange things with empty-matching patterns. Again, it is not something that most people would worry about, but it's interesting nevertheless:

ggrep 'j?a?v?' Hello.java
<nothing>

but

ggrep 'j*a*v*' Hello.java
<everything>

And with option -o:

ggrep -o 'j?a?v?' tests/Hello.java
<nothing>

but

ggrep -o 'j*a*v*' Hello.java
jav
a
a
a
a
a
v
a
a

For option -o, GNU grep behaves like ugrep (without -Y) to return matches, not everything.

To optimize the default ugrep behavior to reject empty matches for e.g. a*b*c* to get actual matches, such as a, b, c, bbccc, aab and so on, the pattern matching engine can be modified as if it is matching a pattern of length 1, not 0. This will capture the pattern when an a, b, or c is found in the input while skipping over everything else.

The modification is possible in ugrep without too much effort and should improve matching speed for these kind of "empty-matching patterns". I will post a timing comparison using my dev version to demonstrate what we're talking about.

@genivia-inc genivia-inc added the enhancement New feature or request label Aug 28, 2023
@genivia-inc genivia-inc pinned this issue Aug 28, 2023
@genivia-inc genivia-inc unpinned this issue Aug 28, 2023
@genivia-inc
Copy link
Member Author

The difference in performance is significant as expected. A simple example without optimizations: 0.98 seconds

time ugrep -c 'j?a?v?a?' benchmarks/corpi/enwik8
692119
0.970u 0.017s 0:00.98 100.0%	0+0k 0+0io 0pf+0w

The same example with two proposed optimizations (hacks) applied: 0.08 seconds

time ugrep -c 'j?a?v?a?' benchmarks/corpi/enwik8
692119
0.074u 0.011s 0:00.08 100.0%	0+0k 0+0io 0pf+0w

Note: GNU grep counts all lines with -c in this case, as if matching the pattern '' as was explained above. Matching everything is quick, but not useful. I hope the use of this technique will improve usefulness. When ugrep is aliased to grep then option -Y will match everything like GNU grep. Not so useful, but compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant