Skip to content

[ENH] Optimize regex algorithm #4624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Sicheng-Pan
Copy link
Contributor

@Sicheng-Pan Sicheng-Pan commented May 23, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Refactors the regex matching algorithm. Instead of finding all possible matches, this implementation tries to find a single match for each candidate algorithm
  • New functionality
    • N/A

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

This was referenced May 23, 2025
Copy link
Contributor Author

Sicheng-Pan commented May 23, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@Sicheng-Pan Sicheng-Pan marked this pull request as ready for review May 23, 2025 19:05
Copy link
Contributor

propel-code-bot bot commented May 23, 2025

Major Refactor: Optimized Regex Matching Algorithm with Improved Efficiency and Documentation

This PR overhauls the regex literal matching algorithm, shifting the approach from finding all possible matches to identifying a single valid match for each candidate document. Major algorithmic and structural changes were made to the NgramLiteralProvider, including the introduction of prefix/suffix lookup tables, a restructured matching flow, improved n-gram prefetching, and detailed inline documentation to clarify the new logic and high-level algorithm. Additional supporting changes were made to enable prefetching at the storage/index layer, and minor improvements to property-based test utilities.

Key Changes:
• Complete refactor of the NgramLiteralProvider's match_literal_with_mask method to use sliding n-gram windows and prefix/suffix tables to efficiently trace document matches.
• Introduced PrefixSuffixLookupTable and corresponding logic to facilitate fast prefix/suffix lookups during regex matching across document n-grams.
• Added a prefetch_ngrams method to the trait and implemented it in FullTextIndexReader to reduce repeated storage access and improve performance.
• Substantially expanded algorithmic comments and code-level documentation, detailing each phase of the new matching algorithm.
• Minor fix in property-based test strategies: regex pattern for literals switched to use '\w{3,}' to improve correctness.

Affected Areas:
• rust/types/src/regex/literal_expr.rs (core regex matching engine, trait definition, algorithm, documentation)
• rust/index/src/fulltext/types.rs (NgramLiteralProvider storage implementation for prefetch support)
• rust/types/src/strategies.rs (test utilities)

Potential Impact:

Functionality: Significant change to precise regex matching behavior; algorithm should yield exact, efficient matches rather than all possibilities. Matching logic now avoids unnecessary combinatorics and traverses candidate docs more efficiently.

Performance: Performance is expected to improve considerably due to batch n-gram prefetching, minimization of candidate search space, and optimized lookup strategies. However, added complexity may introduce new bottlenecks if not thoroughly tested.

Security: No direct security impact identified, but any complexity in regex handling can potentially introduce denial-of-service vectors if not robustly bounded. Needs review for edge-case starvation or panics.

Scalability: Improvements should support handling much larger document and n-gram volumes efficiently, reducing system load under heavy search loads. Scalability bottlenecks should be re-evaluated after this change.

Review Focus:
• Correctness of the new n-gram matching algorithm and its guarantees (should match exactly one sequence per document candidate, not all possible matches).
• Thread-safety, memory safety, and potential for subtle async/concurrency issues (especially in prefetching).
• Adequacy and clarity of the in-code documentation-does it fully explain all complex branches and edge handling?
• Test coverage for new and changed paths, especially around unicode and edge literal cases.
• Check for possible performance regressions or excessive allocations.

Testing Needed

• Run and extend the property-based and unit tests for regex literal and fulltext matching.
• Test edge cases: empty literals, long or highly branching n-grams, unicode edge cases, and very large document sets.
• Benchmark performance changes versus previous implementation.
• Validate that all storage and index concurrency/async boundaries are respected after the changes.

Code Quality Assessment

rust/types/src/regex/literal_expr.rs: Major improvement in in-code algorithm documentation and structure. Use of Vec, String, and HashMap is correct, but reviewer notes possible use of .or_default() for code simplification. Iterative DFS logic may benefit from further comments or consideration for recursive abstraction.

rust/index/src/fulltext/types.rs: Prefetch logic cleanly added. Correct integration with async and blockfile reader APIs.

rust/types/src/strategies.rs: Small regex bug fix; trivial and safe.

Best Practices

Rust:
• Algorithm documentation is highly detailed and aligned with engineering best practices.
• Separation of concerns between storage/index and matching engine layers is preserved.
• Traits and interfaces extended in a backward-compatible, additive way.
• Test scaffolding updated for regression safety.

Potential Issues

• Iterative DFS logic for prefix/suffix tracing may be non-obvious to future maintainers-may require more commentary or a recursive variant if stack depths are not prohibitive.
• Some comments suggest (and reviewers request) additional explanation at certain control flow branches-unclear cases may cause confusion.
• Potential for subtle off-by-one or unicode boundary errors in slicing string literals or offset calculations, especially when handling edge n-gram cases.
• Changes are extensive; any mismatches in trait/caller/async usage or incorrect prefetch may cause silent regressions.

This summary was automatically generated by @propel-code-bot

@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-21-_enh_prefetch_block_by_prefixes branch from f3cedac to 78dbbb3 Compare May 23, 2025 19:51
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from 24a6118 to 541b15f Compare May 23, 2025 19:51
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-21-_enh_prefetch_block_by_prefixes branch from 78dbbb3 to d4d29e1 Compare May 27, 2025 18:37
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from 7e3ef43 to 01591ed Compare May 27, 2025 18:37
lookup_table
.prefix
.entry(prefix)
.or_insert_with(|| Vec::with_capacity(ngrams.len()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PerformanceOptimization]

Consider using entry().or_default() instead of entry().or_insert_with(|| Vec::with_capacity(ngrams.len())) for simpler code. If the performance difference is important, you could benchmark both approaches.

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave a descriptive comment and better document what this code is doing.

@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-21-_enh_prefetch_block_by_prefixes branch from d4d29e1 to a924a12 Compare May 27, 2025 21:55
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch 2 times, most recently from cb6e724 to 75f854a Compare May 28, 2025 21:59
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-21-_enh_prefetch_block_by_prefixes branch from a924a12 to 2f5ee22 Compare May 28, 2025 21:59
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from 75f854a to 2c668c1 Compare May 29, 2025 17:11
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-21-_enh_prefetch_block_by_prefixes branch from 2f5ee22 to 66fba61 Compare May 29, 2025 17:11
.get(min_lookup_table_index + suffix_pos_idx.len() + 1)
{
Some(table) => table,
None => {
Copy link
Contributor

@sanketkedia sanketkedia May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comments explaining this case?

Vec::with_capacity(lookup_table_vec.len() - min_lookup_table_index);
let suffix_offset = ngram.char_indices().nth(1).unwrap_or_default().0;
suffix_pos_idx.push((&ngram[suffix_offset..], pos + suffix_offset as u32, 0));
while let Some((suffix, match_pos, ngram_index)) = suffix_pos_idx.pop() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding comments here. this is an iterative DFS right? Maybe a recursive impl is more readable

@Sicheng-Pan Sicheng-Pan changed the base branch from sicheng/05-21-_enh_prefetch_block_by_prefixes to graphite-base/4624 May 29, 2025 23:03
@Sicheng-Pan Sicheng-Pan force-pushed the graphite-base/4624 branch from 66fba61 to 9c46a5b Compare May 29, 2025 23:06
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from 2c668c1 to db7de67 Compare May 29, 2025 23:06
@Sicheng-Pan Sicheng-Pan changed the base branch from graphite-base/4624 to sicheng/05-21-_enh_prefetch_block_by_prefixes May 29, 2025 23:06
@Sicheng-Pan Sicheng-Pan changed the base branch from sicheng/05-21-_enh_prefetch_block_by_prefixes to graphite-base/4624 May 30, 2025 18:07
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from db7de67 to d4deee3 Compare May 30, 2025 18:07
@Sicheng-Pan Sicheng-Pan force-pushed the graphite-base/4624 branch from 9c46a5b to b2c40df Compare May 30, 2025 18:07
@graphite-app graphite-app bot changed the base branch from graphite-base/4624 to main May 30, 2025 18:07
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from d4deee3 to feab77c Compare May 30, 2025 18:08
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from feab77c to 3863558 Compare May 30, 2025 18:29
@Sicheng-Pan Sicheng-Pan force-pushed the sicheng/05-22-_enh_optimize_regex_algorithm branch from 3863558 to b6beffd Compare May 31, 2025 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants