Faster matching of "empty-matching" patterns (preview) #287

genivia-inc · 2023-08-28T01:52:10Z

This is perhaps an interesting case, but not a common one, to optimize.

An "empty-matching regex pattern" permits zero-length input to be matched. For example, a?b?c? and a*b*c* match a, b and c in various forms in the input, but also matches "nothingness" (empty space between characters).

In fact, I bet that most grep users won't use these patterns much, if ever, because every line in the input is returned as a match anyway. With option -o it has some use to find specific matches though, so there is a use case to optimize.

Ugrep has options to control the matching behavior when empty-matching patterns are used. Option -Y permits empty matches, like GNU grep does. So when ugrep is used as a grep replacement (hard/soft link or alias), then option -Y is enabled to emulate GNU grep. This is not efficient at all yet (at this time of writing) with ugrep to match such patterns.

Actually, GNU grep does kind-of strange things with empty-matching patterns. Again, it is not something that most people would worry about, but it's interesting nevertheless:

ggrep 'j?a?v?' Hello.java
<nothing>

but

ggrep 'j*a*v*' Hello.java
<everything>

And with option -o:

ggrep -o 'j?a?v?' tests/Hello.java
<nothing>

but

ggrep -o 'j*a*v*' Hello.java
jav
a
a
a
a
a
v
a
a

For option -o, GNU grep behaves like ugrep (without -Y) to return matches, not everything.

To optimize the default ugrep behavior to reject empty matches for e.g. a*b*c* to get actual matches, such as a, b, c, bbccc, aab and so on, the pattern matching engine can be modified as if it is matching a pattern of length 1, not 0. This will capture the pattern when an a, b, or c is found in the input while skipping over everything else.

The modification is possible in ugrep without too much effort and should improve matching speed for these kind of "empty-matching patterns". I will post a timing comparison using my dev version to demonstrate what we're talking about.

The text was updated successfully, but these errors were encountered:

genivia-inc · 2023-08-28T16:55:43Z

The difference in performance is significant as expected. A simple example without optimizations: 0.98 seconds

time ugrep -c 'j?a?v?a?' benchmarks/corpi/enwik8
692119
0.970u 0.017s 0:00.98 100.0%	0+0k 0+0io 0pf+0w

The same example with two proposed optimizations (hacks) applied: 0.08 seconds

time ugrep -c 'j?a?v?a?' benchmarks/corpi/enwik8
692119
0.074u 0.011s 0:00.08 100.0%	0+0k 0+0io 0pf+0w

Note: GNU grep counts all lines with -c in this case, as if matching the pattern '' as was explained above. Matching everything is quick, but not useful. I hope the use of this technique will improve usefulness. When ugrep is aliased to grep then option -Y will match everything like GNU grep. Not so useful, but compatible.

genivia-inc added the enhancement New feature or request label Aug 28, 2023

genivia-inc pinned this issue Aug 28, 2023

genivia-inc unpinned this issue Aug 28, 2023

genivia-inc mentioned this issue Aug 31, 2023

Faster ugrep 4.1 performance report (preview - not yet released) #289

Closed

genivia-inc closed this as completed in 4af28c8 Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster matching of "empty-matching" patterns (preview) #287

Faster matching of "empty-matching" patterns (preview) #287

genivia-inc commented Aug 28, 2023

genivia-inc commented Aug 28, 2023

Faster matching of "empty-matching" patterns (preview) #287

Faster matching of "empty-matching" patterns (preview) #287

Comments

genivia-inc commented Aug 28, 2023

genivia-inc commented Aug 28, 2023