-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[ENH] Optimize regex algorithm #4624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Major Refactor: Optimized Regex Matching Algorithm with Improved Efficiency and Documentation This PR overhauls the regex literal matching algorithm, shifting the approach from finding all possible matches to identifying a single valid match for each candidate document. Major algorithmic and structural changes were made to the NgramLiteralProvider, including the introduction of prefix/suffix lookup tables, a restructured matching flow, improved n-gram prefetching, and detailed inline documentation to clarify the new logic and high-level algorithm. Additional supporting changes were made to enable prefetching at the storage/index layer, and minor improvements to property-based test utilities. Key Changes: Affected Areas: Potential Impact: Functionality: Significant change to precise regex matching behavior; algorithm should yield exact, efficient matches rather than all possibilities. Matching logic now avoids unnecessary combinatorics and traverses candidate docs more efficiently. Performance: Performance is expected to improve considerably due to batch n-gram prefetching, minimization of candidate search space, and optimized lookup strategies. However, added complexity may introduce new bottlenecks if not thoroughly tested. Security: No direct security impact identified, but any complexity in regex handling can potentially introduce denial-of-service vectors if not robustly bounded. Needs review for edge-case starvation or panics. Scalability: Improvements should support handling much larger document and n-gram volumes efficiently, reducing system load under heavy search loads. Scalability bottlenecks should be re-evaluated after this change. Review Focus: Testing Needed• Run and extend the property-based and unit tests for regex literal and fulltext matching. Code Quality Assessmentrust/types/src/regex/literal_expr.rs: Major improvement in in-code algorithm documentation and structure. Use of Vec, String, and HashMap is correct, but reviewer notes possible use of .or_default() for code simplification. Iterative DFS logic may benefit from further comments or consideration for recursive abstraction. rust/index/src/fulltext/types.rs: Prefetch logic cleanly added. Correct integration with async and blockfile reader APIs. rust/types/src/strategies.rs: Small regex bug fix; trivial and safe. Best PracticesRust: Potential Issues• Iterative DFS logic for prefix/suffix tracing may be non-obvious to future maintainers-may require more commentary or a recursive variant if stack depths are not prohibitive. This summary was automatically generated by @propel-code-bot |
f3cedac
to
78dbbb3
Compare
24a6118
to
541b15f
Compare
78dbbb3
to
d4d29e1
Compare
7e3ef43
to
01591ed
Compare
lookup_table | ||
.prefix | ||
.entry(prefix) | ||
.or_insert_with(|| Vec::with_capacity(ngrams.len())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PerformanceOptimization]
Consider using entry().or_default()
instead of entry().or_insert_with(|| Vec::with_capacity(ngrams.len()))
for simpler code. If the performance difference is important, you could benchmark both approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave a descriptive comment and better document what this code is doing.
d4d29e1
to
a924a12
Compare
cb6e724
to
75f854a
Compare
a924a12
to
2f5ee22
Compare
75f854a
to
2c668c1
Compare
2f5ee22
to
66fba61
Compare
.get(min_lookup_table_index + suffix_pos_idx.len() + 1) | ||
{ | ||
Some(table) => table, | ||
None => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comments explaining this case?
Vec::with_capacity(lookup_table_vec.len() - min_lookup_table_index); | ||
let suffix_offset = ngram.char_indices().nth(1).unwrap_or_default().0; | ||
suffix_pos_idx.push((&ngram[suffix_offset..], pos + suffix_offset as u32, 0)); | ||
while let Some((suffix, match_pos, ngram_index)) = suffix_pos_idx.pop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth adding comments here. this is an iterative DFS right? Maybe a recursive impl is more readable
66fba61
to
9c46a5b
Compare
2c668c1
to
db7de67
Compare
db7de67
to
d4deee3
Compare
9c46a5b
to
b2c40df
Compare
d4deee3
to
feab77c
Compare
feab77c
to
3863558
Compare
3863558
to
b6beffd
Compare
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?