-
Notifications
You must be signed in to change notification settings - Fork 0
Add lexer recovery; improve string/number lexing; derive Eq/Hash #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The previous lexer implementation was brittle. It would halt at the first sign of an error, preventing the parser from identifying other issues in the file. It was also too permissive with invalid number formats and inefficient when parsing certain string patterns. This patch hardens the lexer by introducing an optional recovery mode to gracefully handle errors and continue tokenizing, which allows for more complete diagnostics in a single pass. Validation for numeric literals is now stricter, and string parsing has been made more performant.
The doc comment parsing logic has been updated to be less destructive. Previously, it would trim all leading and trailing whitespace from comment lines. The new implementation only removes a single optional leading space, preserving intentional indentation and formatting. This change aligns the parser with an updated lexer behavior where the `///` marker is stripped earlier in the pipeline.
Core AST nodes such as `Ident`, `TypeRef`, and `QualifiedIdent` now derive `PartialEq`, `Eq`, and `Hash`. This allows them to be compared for equality and used in hash-based collections, a prerequisite for upcoming semantic analysis features. This also includes minor cleanup by removing several unnecessary clippy `expect` attributes from test modules.
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/core/scanner/lexer.rs (1)
884-915: Numeric lexer allows multiple exponents (e.g.,1e2e3)Current loop permits more than one 'e/E' since no
has_exponentguard. That lexes invalid numbers as a single literal.Apply this fix:
impl NumberLiteralRecognizer { @@ pub fn new() -> Self { Self } } @@ fn consume( &self, input: &mut dyn CharacterStream, ) -> Result<TokenType, LexError> { let mut number = String::new(); let mut has_dot = false; + let mut has_exponent = false; @@ - 'e' | 'E' - if number + 'e' | 'E' + if !has_exponent + && number .chars() .last() .is_some_and(|last| last.is_ascii_digit()) => { @@ - // Valid exponent header; consume it + // Valid exponent header; consume it number.push(ch); input.advance(); // 'e' / 'E' + has_exponent = true; if let Some(sign) = input.current() && (sign == '+' || sign == '-') { number.push(sign); input.advance(); } }Add a test like
assert_eq!(lex("1e2e3"), ["Literal(\"1e2\")","Identifier(\"e3\")", ...])or expect an error per your policy.
🧹 Nitpick comments (8)
src/core/scanner/tokens.rs (1)
169-172: Literal Display spacing: good; consider aligning Identifier tooChange makes Literal formatting consistent with other variants. Identifier still prints as `Identifier( `` (no space). Consider aligning for uniformity.
Apply outside-range tweak if desired:
- let _ = f.write_str("Identifier( `"); + let _ = f.write_str("Identifier ( `");src/core/parser/ast/mod.rs (1)
610-616: Derives on QualifiedIdent (PartialEq, Eq, Hash): confirm span-in-equality intentEquality/hash include
span. If you want “semantic” equality ignoring positions, consider a span-less newtype or manual impls.src/core/scanner/lexer.rs (3)
568-591: Static keyword table via OnceLock: solid; consider phf for zero-cost initRuntime init is one-time and fine. If you want compile-time,
phfcould remove the map build. Not necessary now.
1315-1322: Invalid exponent cases: good; consider adding1e2e3Extends negative tests; please add multiple-exponent case post-fix.
1595-1615: Invalid numbers loop: OK; broaden setCovers mixed outcomes (error or split). Add
1e2e3here after fixing exponent guard.src/core/parser/components/doc_integration_tests.rs (1)
20-23: Span end-column should account for optional space and Unicode; avoid silent overflow.Right now end = 4 + text.len() (bytes). If the inner text doesn’t start with a space, spans will be off by 1; for non-ASCII, byte length ≠ columns. Consider this:
- TokenType::DocComment(text.to_string()), - (line, 1), - (line, 4 + u32::try_from(text.len()).unwrap_or(0)), + TokenType::DocComment(text.to_string()), + (line, 1), + (line, { + let prefix = 3 + u32::from(text.starts_with(' ')); // "///" + optional single space + let width = u32::try_from(text.chars().count()).unwrap_or(u32::MAX); + prefix.saturating_add(width) + }),src/core/parser/components/helpers.rs (1)
29-38: Doc extraction rule (remove at most one leading space) looks good; compact implementation.Equivalent but simpler and branchless:
- if let TokenType::DocComment(text) = token.r#type() { - if let Some(rest) = text.strip_prefix(' ') { - Some(rest.to_string()) - } else { - Some(text.to_string()) - } - } else { - None - } + if let TokenType::DocComment(text) = token.r#type() { + Some(text.strip_prefix(' ').unwrap_or(text).to_string()) + } else { + None + }src/core/parser/components/expressions.rs (1)
1443-1444: Helper now emits inner-text DocComment — tests still use the "///" prefixUpdate tests to construct TokenType::DocComment with the inner text (e.g. "doc comment") instead of "/// doc comment".
Occurrences: src/core/parser/components/members.rs:1737,1761,1783,1806,1825,1846,1865; src/core/parser/components/declarations.rs:1152,1164,1176,1188,1200.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
src/core/parser/ast/mod.rs(5 hunks)src/core/parser/components/attributes.rs(0 hunks)src/core/parser/components/declarations.rs(0 hunks)src/core/parser/components/doc_integration_tests.rs(1 hunks)src/core/parser/components/expressions.rs(5 hunks)src/core/parser/components/helpers.rs(2 hunks)src/core/parser/components/members.rs(0 hunks)src/core/parser/components/primitives.rs(0 hunks)src/core/parser/components/types.rs(0 hunks)src/core/parser/config.rs(0 hunks)src/core/parser/progress.rs(0 hunks)src/core/parser/schema_parser.rs(0 hunks)src/core/parser/stream.rs(0 hunks)src/core/scanner/lexer.rs(15 hunks)src/core/scanner/tokens.rs(3 hunks)
💤 Files with no reviewable changes (9)
- src/core/parser/stream.rs
- src/core/parser/schema_parser.rs
- src/core/parser/config.rs
- src/core/parser/components/primitives.rs
- src/core/parser/components/attributes.rs
- src/core/parser/progress.rs
- src/core/parser/components/types.rs
- src/core/parser/components/members.rs
- src/core/parser/components/declarations.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-07T04:24:02.234Z
Learnt from: mkpro118
PR: mkpro118/prisma-rs#12
File: src/core/scanner/tokens.rs:119-188
Timestamp: 2025-09-07T04:24:02.234Z
Learning: In src/core/scanner/tokens.rs, the developer prefers using qualified names like `TokenType::String` instead of importing variants into the use statement to avoid confusion with Rust's built-in types like `String`. This design choice prioritizes code clarity over brevity.
Applied to files:
src/core/scanner/tokens.rs
📚 Learning: 2025-09-07T04:23:03.787Z
Learnt from: mkpro118
PR: mkpro118/prisma-rs#12
File: src/core/scanner/lexer.rs:561-579
Timestamp: 2025-09-07T04:23:03.787Z
Learning: In the lexer implementation for the prisma-rs project, the API contract expects TokenRecognizer methods `can_handle` and `consume` to be called sequentially without stream modification between calls. Violating this contract (e.g., modifying the stream) is considered user error and panic behavior is acceptable rather than defensive programming with safe lookups.
Applied to files:
src/core/scanner/lexer.rs
🧬 Code graph analysis (3)
src/core/parser/components/helpers.rs (1)
src/core/parser/components/doc_integration_tests.rs (1)
tok(27-29)
src/core/parser/components/expressions.rs (1)
src/core/parser/components/helpers.rs (1)
extract_doc_text(32-42)
src/core/scanner/lexer.rs (1)
src/core/scanner/tokens.rs (3)
new(270-288)span(298-300)r#type(292-294)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: Test beta on ubuntu-latest --no-default-features
- GitHub Check: Test nightly on macos-latest --all-features
- GitHub Check: Test nightly on macos-latest
- GitHub Check: Test stable on macos-latest
- GitHub Check: Test nightly on windows-latest --all-features
- GitHub Check: Test nightly on ubuntu-latest --all-features
- GitHub Check: Test nightly on ubuntu-latest
- GitHub Check: Test stable on windows-latest
- GitHub Check: Test nightly on windows-latest
- GitHub Check: Cargo Check (windows-latest, nightly, --all-features)
- GitHub Check: Cargo Check (windows-latest, beta)
- GitHub Check: Cargo Check (ubuntu-latest, nightly, --all-features)
- GitHub Check: Cargo Check (ubuntu-latest, beta, --no-default-features)
- GitHub Check: Cargo Check (windows-latest, beta, --all-features)
- GitHub Check: Cargo Check (windows-latest, stable)
- GitHub Check: Security Audit
- GitHub Check: Security Audit
🔇 Additional comments (27)
src/core/scanner/tokens.rs (2)
199-205: Eq/Hash derives on SymbolLocation: LGTMDeriving Eq/Hash is safe on u32 fields and unlocks map/set usage. No behavioral risk.
227-233: Eq/Hash derives on SymbolSpan: LGTMConsistent with Location; enables hashing spans in diagnostics and tests.
src/core/parser/ast/mod.rs (4)
663-669: Derives on TypeRef: good and consistentEnums with inner nodes now comparable/hashable. Works with downstream collections.
If any caches/keys rely on “type-only” identity, confirm inclusion of spans is intended.
716-722: Derives on NamedType: LGTMHash/Eq include
span; same caveat as above.
746-752: Derives on ListType: LGTM
Box<TypeRef>hashing is fine (hashes pointee). No issues.
1080-1085: Derives on Ident: LGTMHash/Eq on small struct; enables use as map keys without extra wrappers.
src/core/scanner/lexer.rs (14)
15-23: Clear error-recovery docsGood explanation of sticky vs. advancing behavior. Matches implementation below.
338-356: Recover-on-error flag wiring: LGTMField added and defaulted to false. No API break.
367-376: default_for_input_with_recovery: good ergonomicsSimple constructor to flip recovery at creation time.
377-383: with_error_recovery builder: LGTMFluent toggle; doesn’t consume external state.
422-439: Advance-on-error path: correct and minimalAdvances a single char only when recovery is enabled; preserves error span/message. Matches contract from previous learnings.
625-631: KeywordRecognizer uses static map: LGTMIndexing with
[]is safe given priorcontains_keyand our can_handle/consume contract.Also applies to: 638-640
781-801: String literal escape handling: correct even/odd backslash logicTracks backslash runs and treats quotes accordingly. Good.
1265-1273: Perf sanity test for long backslashes: niceCovers the new escape logic under stress.
1303-1314: Exponent with sign tests: good coverageValidates
e+handling.
1331-1345: Error span precision tests: excellentAsserts the 'e' column exactly. Helps tooling UX.
1347-1359: Minus + exponent span test: LGTMCovers signed mantissa case.
1370-1390:1.and-1.split behavior: correct and testedEnsures dot becomes a separate token when no following digit.
1393-1403: Even/odd backslash before quote: good testsDirectly validates the new string-terminator rule.
1532-1564: Recovery mode tests: great UX differentiationConfirms sticky default vs. opt-in progress. Clear behavior.
src/core/parser/components/doc_integration_tests.rs (1)
20-21: Switching DocComment to inner text is correct.Matches helpers::extract_doc_text semantics; expectations below look consistent.
src/core/parser/components/helpers.rs (2)
128-134: Tests updated to inner-text convention are accurate.Covers basic variants; assertions match the new rule.
136-149: Nice edge-case coverage.Verifies “remove only one space” and preserves tabs/other whitespace.
src/core/parser/components/expressions.rs (4)
1640-1644: Whitespace-only docs expectations look correct.Confirms single leading-space removal and preservation of remaining spaces.
1703-1710: Prefix-space case validated.Matches helpers::extract_doc_text behavior.
1727-1735: Extra-leading-spaces case validated.Only one space stripped; trailing spaces preserved. Good.
1753-1766: Docs span test updated to inner-text tokens; assertions remain correct.Span logic unaffected by inner-text switch.
Summary by CodeRabbit