Skip to content

Update v7 coming soon - here is what to expect #432

Closed
@genivia-inc

Description

@genivia-inc

Update v7 overhauls the search engine internals to further increase ugrep's search speeds. This update includes new SIMD logic and improved pattern analysis to collect metrics for the heuristic selection of search algorithms. To evaluate the optimizations, I have been using a benchmark setup for arm64 and x64 for a while now that checks the search speed of multi-word searches from 1 to 1024 words to search of minimal lengths 1 to 8 in a 100MB text file. The performance results collected are also a good indication for the improved regex search speed, because the internal optimizations are identical.

The v7 update runs faster overall. I will show one example. There are many more, but those are similar. I won't bore you with the details. Searching 1 to 32 words (patterns like word1|word2|...|word32) with randomly picked words of length >=3 in a 100MB file (enwik8) is more consistently faster in v7 (new version) for arm64 CPUs versus v6 (old version):

image

Each point in the graph is an average of 100 runs with the same number of words, but different words, of length >=3.

For x64 CPUs:

image

Furthermore, the DFA cut optimization algorithm is improved. This optimization speeds up regex search of patterns with leading repetitions and other forms that tend to be expensive to find, such as [a-zA-Z]*z that match words that end in a z. When the DFA is cut, the DFA-based regex match first looks for a z then looks back to include the leading part. It is not perfect (looking back using a reversed regex would be best), but works very well in general for a wide range of cases in the benchmarks I've created to specifically evaluate and optimize DFA cuts. Perhaps I will publish the details in an article at a later date.

The v7 update also corrects minor regex anchor and word boundary matching issues I found in more specialized and extensive testing setups, which may occur when multiple anchors/bounaries are used that may require backtracking. Backtracking is very costly, so ugrep backtracks in a limited way to avoid slowdowns. Normally this does not matter when using one anchor/boundary in a regex, but when multiple are used in the same pattern then things get complicated. Update v7 strikes a better balance to support regex anchors and word boundary placements within a regex pattern, but does not backtrack unlimited. As a result of limited backtracking, there may be special cases of patterns with multiple anchors and word boundaries for which the pattern match results could differ from Perl (PCRE) matching, such as showing shorter pattern matches.

Verification with code inspections, correctness testing and benchmarking are critical, which takes time, often more time than coming up with new ideas and implementations in the first place.

There is always more that can be done. Making this project more perfect is an endless pursuit. I will update below if other things come to mind for update v7 that are insightful to share.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions