[ENH] Optimize regex algorithm #4624

Sicheng-Pan · 2025-05-23T19:03:39Z

Description of changes

Summarize the changes made by this PR.

Improvements & Bug fixes
- Refactors the regex matching algorithm. Instead of finding all possible matches, this implementation tries to find a single match for each candidate algorithm
New functionality
- N/A

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Sicheng-Pan · 2025-05-23T19:03:57Z

[ENH] Optimize regex algorithm #4624 👈 (View in Graphite)
[ENH] Prefetch block by prefixes #4623
[TST] Regex testing #4570
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2025-05-23T19:03:58Z

propel-code-bot · 2025-05-23T19:05:55Z

Major Refactor: Optimized Regex Matching Algorithm with Improved Efficiency and Documentation

This PR overhauls the regex literal matching algorithm, shifting the approach from finding all possible matches to identifying a single valid match for each candidate document. Major algorithmic and structural changes were made to the NgramLiteralProvider, including the introduction of prefix/suffix lookup tables, a restructured matching flow, improved n-gram prefetching, and detailed inline documentation to clarify the new logic and high-level algorithm. Additional supporting changes were made to enable prefetching at the storage/index layer, and minor improvements to property-based test utilities.

Key Changes:
• Complete refactor of the NgramLiteralProvider's match_literal_with_mask method to use sliding n-gram windows and prefix/suffix tables to efficiently trace document matches.
• Introduced PrefixSuffixLookupTable and corresponding logic to facilitate fast prefix/suffix lookups during regex matching across document n-grams.
• Added a prefetch_ngrams method to the trait and implemented it in FullTextIndexReader to reduce repeated storage access and improve performance.
• Substantially expanded algorithmic comments and code-level documentation, detailing each phase of the new matching algorithm.
• Minor fix in property-based test strategies: regex pattern for literals switched to use '\w{3,}' to improve correctness.

Affected Areas:
• rust/types/src/regex/literal_expr.rs (core regex matching engine, trait definition, algorithm, documentation)
• rust/index/src/fulltext/types.rs (NgramLiteralProvider storage implementation for prefetch support)
• rust/types/src/strategies.rs (test utilities)

Potential Impact:

Functionality: Significant change to precise regex matching behavior; algorithm should yield exact, efficient matches rather than all possibilities. Matching logic now avoids unnecessary combinatorics and traverses candidate docs more efficiently.

Performance: Performance is expected to improve considerably due to batch n-gram prefetching, minimization of candidate search space, and optimized lookup strategies. However, added complexity may introduce new bottlenecks if not thoroughly tested.

Security: No direct security impact identified, but any complexity in regex handling can potentially introduce denial-of-service vectors if not robustly bounded. Needs review for edge-case starvation or panics.

Scalability: Improvements should support handling much larger document and n-gram volumes efficiently, reducing system load under heavy search loads. Scalability bottlenecks should be re-evaluated after this change.

Review Focus:
• Correctness of the new n-gram matching algorithm and its guarantees (should match exactly one sequence per document candidate, not all possible matches).
• Thread-safety, memory safety, and potential for subtle async/concurrency issues (especially in prefetching).
• Adequacy and clarity of the in-code documentation-does it fully explain all complex branches and edge handling?
• Test coverage for new and changed paths, especially around unicode and edge literal cases.
• Check for possible performance regressions or excessive allocations.

Testing Needed

• Run and extend the property-based and unit tests for regex literal and fulltext matching.
• Test edge cases: empty literals, long or highly branching n-grams, unicode edge cases, and very large document sets.
• Benchmark performance changes versus previous implementation.
• Validate that all storage and index concurrency/async boundaries are respected after the changes.

Code Quality Assessment

rust/types/src/regex/literal_expr.rs: Major improvement in in-code algorithm documentation and structure. Use of Vec, String, and HashMap is correct, but reviewer notes possible use of .or_default() for code simplification. Iterative DFS logic may benefit from further comments or consideration for recursive abstraction.

rust/index/src/fulltext/types.rs: Prefetch logic cleanly added. Correct integration with async and blockfile reader APIs.

rust/types/src/strategies.rs: Small regex bug fix; trivial and safe.

Best Practices

Rust:
• Algorithm documentation is highly detailed and aligned with engineering best practices.
• Separation of concerns between storage/index and matching engine layers is preserved.
• Traits and interfaces extended in a backward-compatible, additive way.
• Test scaffolding updated for regression safety.

Potential Issues

• Iterative DFS logic for prefix/suffix tracing may be non-obvious to future maintainers-may require more commentary or a recursive variant if stack depths are not prohibitive.
• Some comments suggest (and reviewers request) additional explanation at certain control flow branches-unclear cases may cause confusion.
• Potential for subtle off-by-one or unicode boundary errors in slicing string literals or offset calculations, especially when handling edge n-gram cases.
• Changes are extensive; any mismatches in trait/caller/async usage or incorrect prefetch may cause silent regressions.

This summary was automatically generated by @propel-code-bot

propel-code-bot · 2025-05-27T18:38:54Z

rust/types/src/regex/literal_expr.rs

+                lookup_table
+                    .prefix
+                    .entry(prefix)
+                    .or_insert_with(|| Vec::with_capacity(ngrams.len()))


[PerformanceOptimization]

Consider using entry().or_default() instead of entry().or_insert_with(|| Vec::with_capacity(ngrams.len())) for simpler code. If the performance difference is important, you could benchmark both approaches.

HammadB

I think we should leave a descriptive comment and better document what this code is doing.

sanketkedia · 2025-05-29T21:46:33Z

rust/types/src/regex/literal_expr.rs

+                        .get(min_lookup_table_index + suffix_pos_idx.len() + 1)
+                    {
+                        Some(table) => table,
+                        None => {


add comments explaining this case?

sanketkedia · 2025-05-29T22:54:43Z

rust/types/src/regex/literal_expr.rs

+                    Vec::with_capacity(lookup_table_vec.len() - min_lookup_table_index);
+                let suffix_offset = ngram.char_indices().nth(1).unwrap_or_default().0;
+                suffix_pos_idx.push((&ngram[suffix_offset..], pos + suffix_offset as u32, 0));
+                while let Some((suffix, match_pos, ngram_index)) = suffix_pos_idx.pop() {


worth adding comments here. this is an iterative DFS right? Maybe a recursive impl is more readable

This was referenced May 23, 2025

[TST] Regex testing #4570

Merged

[ENH] Prefetch block by prefixes #4623

Merged

Sicheng-Pan marked this pull request as ready for review May 23, 2025 19:05

Sicheng-Pan force-pushed the sicheng/05-21-_enh_prefetch_block_by_prefixes branch from f3cedac to 78dbbb3 Compare May 23, 2025 19:51

Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from 24a6118 to 541b15f Compare May 23, 2025 19:51

Sicheng-Pan force-pushed the sicheng/05-21-_enh_prefetch_block_by_prefixes branch from 78dbbb3 to d4d29e1 Compare May 27, 2025 18:37

Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from 7e3ef43 to 01591ed Compare May 27, 2025 18:37

propel-code-bot bot reviewed May 27, 2025

View reviewed changes

HammadB reviewed May 27, 2025

View reviewed changes

Sicheng-Pan force-pushed the sicheng/05-21-_enh_prefetch_block_by_prefixes branch from d4d29e1 to a924a12 Compare May 27, 2025 21:55

Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch 2 times, most recently from cb6e724 to 75f854a Compare May 28, 2025 21:59

Sicheng-Pan force-pushed the sicheng/05-21-_enh_prefetch_block_by_prefixes branch from a924a12 to 2f5ee22 Compare May 28, 2025 21:59

Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from 75f854a to 2c668c1 Compare May 29, 2025 17:11

Sicheng-Pan force-pushed the sicheng/05-21-_enh_prefetch_block_by_prefixes branch from 2f5ee22 to 66fba61 Compare May 29, 2025 17:11

sanketkedia reviewed May 29, 2025

View reviewed changes

sanketkedia approved these changes May 29, 2025

View reviewed changes

Sicheng-Pan changed the base branch from sicheng/05-21-_enh_prefetch_block_by_prefixes to graphite-base/4624 May 29, 2025 23:03

Sicheng-Pan force-pushed the graphite-base/4624 branch from 66fba61 to 9c46a5b Compare May 29, 2025 23:06

Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from 2c668c1 to db7de67 Compare May 29, 2025 23:06

Sicheng-Pan changed the base branch from graphite-base/4624 to sicheng/05-21-_enh_prefetch_block_by_prefixes May 29, 2025 23:06

Sicheng-Pan changed the base branch from sicheng/05-21-_enh_prefetch_block_by_prefixes to graphite-base/4624 May 30, 2025 18:07

Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from db7de67 to d4deee3 Compare May 30, 2025 18:07

Sicheng-Pan force-pushed the graphite-base/4624 branch from 9c46a5b to b2c40df Compare May 30, 2025 18:07

graphite-app bot changed the base branch from graphite-base/4624 to main May 30, 2025 18:07

Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from d4deee3 to feab77c Compare May 30, 2025 18:08

Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from feab77c to 3863558 Compare May 30, 2025 18:29

sicheng added 7 commits May 30, 2025 17:09

[ENH] Optimize regex algorithm

d2c876b

Implement extend matching

0f1ae7e

Fix bugs

03568e7

Do not use cursor approach

ff4b3ae

Remove unnecessary sort

053c8df

Fix unicode support

b4bb449

Add some documentation for the algorithm

b6beffd

Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from 3863558 to b6beffd Compare May 31, 2025 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] Optimize regex algorithm #4624

[ENH] Optimize regex algorithm #4624

Uh oh!

Sicheng-Pan commented May 23, 2025 •

edited

Loading

Uh oh!

Sicheng-Pan commented May 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented May 23, 2025

Uh oh!

propel-code-bot bot commented May 23, 2025 •

edited

Loading

Uh oh!

propel-code-bot bot May 27, 2025

Uh oh!

HammadB left a comment

Uh oh!

sanketkedia May 29, 2025 •

edited

Loading

Uh oh!

sanketkedia May 29, 2025

Uh oh!

Uh oh!

[ENH] Optimize regex algorithm #4624

Are you sure you want to change the base?

[ENH] Optimize regex algorithm #4624

Uh oh!

Conversation

Sicheng-Pan commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Documentation Changes

Uh oh!

Sicheng-Pan commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 23, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot bot commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

propel-code-bot bot May 27, 2025

Choose a reason for hiding this comment

Uh oh!

HammadB left a comment

Choose a reason for hiding this comment

Uh oh!

sanketkedia May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanketkedia May 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sicheng-Pan commented May 23, 2025 •

edited

Loading

Sicheng-Pan commented May 23, 2025 •

edited

Loading

propel-code-bot bot commented May 23, 2025 •

edited

Loading

sanketkedia May 29, 2025 •

edited

Loading