Add lexer recovery; improve string/number lexing; derive Eq/Hash #19

mkpro118 · 2025-09-14T10:42:52Z

Summary by CodeRabbit

New Features
- Optional lexer error recovery to continue scanning after errors.
- Added equality and hashing support for key syntax nodes and source span types (enables efficient comparisons and lookups).
Bug Fixes
- Correct handling of escaped quotes in string literals, including long backslash sequences.
- Stricter numeric exponent parsing with clearer errors when digits are missing.
Style
- Minor display formatting tweak for literal token representation.
Tests
- Expanded coverage for error recovery, string escaping, and numeric exponents; removed redundant lint suppressions.

The previous lexer implementation was brittle. It would halt at the first sign of an error, preventing the parser from identifying other issues in the file. It was also too permissive with invalid number formats and inefficient when parsing certain string patterns. This patch hardens the lexer by introducing an optional recovery mode to gracefully handle errors and continue tokenizing, which allows for more complete diagnostics in a single pass. Validation for numeric literals is now stricter, and string parsing has been made more performant.

The doc comment parsing logic has been updated to be less destructive. Previously, it would trim all leading and trailing whitespace from comment lines. The new implementation only removes a single optional leading space, preserving intentional indentation and formatting. This change aligns the parser with an updated lexer behavior where the `///` marker is stripped earlier in the pipeline.

Core AST nodes such as `Ident`, `TypeRef`, and `QualifiedIdent` now derive `PartialEq`, `Eq`, and `Hash`. This allows them to be compared for equality and used in hash-based collections, a prerequisite for upcoming semantic analysis features. This also includes minor cleanup by removing several unnecessary clippy `expect` attributes from test modules.

coderabbitai · 2025-09-14T10:42:59Z

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The provided title "@coderabbitai" is not a descriptive PR title and does not summarize the changes in the changeset; it reads like a username/handle rather than a concise description. The changeset clearly makes substantive edits (adding PartialEq/Eq/Hash derives to AST types, introducing lexer error-recovery and string/exponent handling improvements, and updating tests), so the title should reflect the main change. As written it fails the requirement that a title be a short, clear sentence highlighting the most important change.	Please replace the title with a short, specific sentence describing the primary change; for example: "Add lexer error recovery and improve string/exponent handling; add Eq/Hash derives for AST nodes" or "Enhance scanner and parser: lexer recovery, string escape fixes, and AST Eq/Hash derives". Ensure the final title is concise, focuses on the most important change, and avoids usernames or non-descriptive text.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch enhance-scanner-and-parser

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/core/scanner/lexer.rs (1)

884-915: Numeric lexer allows multiple exponents (e.g., 1e2e3)

Current loop permits more than one 'e/E' since no has_exponent guard. That lexes invalid numbers as a single literal.

Apply this fix:

 impl NumberLiteralRecognizer {
@@
     pub fn new() -> Self {
         Self
     }
 }
@@
     fn consume(
         &self,
         input: &mut dyn CharacterStream,
     ) -> Result<TokenType, LexError> {
         let mut number = String::new();
         let mut has_dot = false;
+        let mut has_exponent = false;
@@
-                'e' | 'E'
-                    if number
+                'e' | 'E'
+                    if !has_exponent
+                        && number
                         .chars()
                         .last()
                         .is_some_and(|last| last.is_ascii_digit()) =>
                 {
@@
-                    // Valid exponent header; consume it
+                    // Valid exponent header; consume it
                     number.push(ch);
                     input.advance(); // 'e' / 'E'
+                    has_exponent = true;
                     if let Some(sign) = input.current()
                         && (sign == '+' || sign == '-')
                     {
                         number.push(sign);
                         input.advance();
                     }
                 }

Add a test like assert_eq!(lex("1e2e3"), ["Literal(\"1e2\")","Identifier(\"e3\")", ...]) or expect an error per your policy.

🧹 Nitpick comments (8)

src/core/scanner/tokens.rs (1)
169-172: Literal Display spacing: good; consider aligning Identifier too

Change makes Literal formatting consistent with other variants. Identifier still prints as `Identifier( `` (no space). Consider aligning for uniformity.

Apply outside-range tweak if desired:
-                let _ = f.write_str("Identifier( `");
+                let _ = f.write_str("Identifier ( `");
src/core/parser/ast/mod.rs (1)

610-616: Derives on QualifiedIdent (PartialEq, Eq, Hash): confirm span-in-equality intent

Equality/hash include span. If you want “semantic” equality ignoring positions, consider a span-less newtype or manual impls.

src/core/scanner/lexer.rs (3)

568-591: Static keyword table via OnceLock: solid; consider phf for zero-cost init

Runtime init is one-time and fine. If you want compile-time, phf could remove the map build. Not necessary now.

1315-1322: Invalid exponent cases: good; consider adding 1e2e3

Extends negative tests; please add multiple-exponent case post-fix.

1595-1615: Invalid numbers loop: OK; broaden set

Covers mixed outcomes (error or split). Add 1e2e3 here after fixing exponent guard.
src/core/parser/components/doc_integration_tests.rs (1)
20-23: Span end-column should account for optional space and Unicode; avoid silent overflow.

Right now end = 4 + text.len() (bytes). If the inner text doesn’t start with a space, spans will be off by 1; for non-ASCII, byte length ≠ columns. Consider this:
-        TokenType::DocComment(text.to_string()),
-        (line, 1),
-        (line, 4 + u32::try_from(text.len()).unwrap_or(0)),
+        TokenType::DocComment(text.to_string()),
+        (line, 1),
+        (line, {
+            let prefix = 3 + u32::from(text.starts_with(' ')); // "///" + optional single space
+            let width = u32::try_from(text.chars().count()).unwrap_or(u32::MAX);
+            prefix.saturating_add(width)
+        }),
src/core/parser/components/helpers.rs (1)
29-38: Doc extraction rule (remove at most one leading space) looks good; compact implementation.

Equivalent but simpler and branchless:
-    if let TokenType::DocComment(text) = token.r#type() {
-        if let Some(rest) = text.strip_prefix(' ') {
-            Some(rest.to_string())
-        } else {
-            Some(text.to_string())
-        }
-    } else {
-        None
-    }
+    if let TokenType::DocComment(text) = token.r#type() {
+        Some(text.strip_prefix(' ').unwrap_or(text).to_string())
+    } else {
+        None
+    }
src/core/parser/components/expressions.rs (1)

1443-1444: Helper now emits inner-text DocComment — tests still use the "///" prefix

Update tests to construct TokenType::DocComment with the inner text (e.g. "doc comment") instead of "/// doc comment".
Occurrences: src/core/parser/components/members.rs:1737,1761,1783,1806,1825,1846,1865; src/core/parser/components/declarations.rs:1152,1164,1176,1188,1200.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67493e6 and 6e7f964.

📒 Files selected for processing (15)

src/core/parser/ast/mod.rs (5 hunks)
src/core/parser/components/attributes.rs (0 hunks)
src/core/parser/components/declarations.rs (0 hunks)
src/core/parser/components/doc_integration_tests.rs (1 hunks)
src/core/parser/components/expressions.rs (5 hunks)
src/core/parser/components/helpers.rs (2 hunks)
src/core/parser/components/members.rs (0 hunks)
src/core/parser/components/primitives.rs (0 hunks)
src/core/parser/components/types.rs (0 hunks)
src/core/parser/config.rs (0 hunks)
src/core/parser/progress.rs (0 hunks)
src/core/parser/schema_parser.rs (0 hunks)
src/core/parser/stream.rs (0 hunks)
src/core/scanner/lexer.rs (15 hunks)
src/core/scanner/tokens.rs (3 hunks)

💤 Files with no reviewable changes (9)

src/core/parser/stream.rs
src/core/parser/schema_parser.rs
src/core/parser/config.rs
src/core/parser/components/primitives.rs
src/core/parser/components/attributes.rs
src/core/parser/progress.rs
src/core/parser/components/types.rs
src/core/parser/components/members.rs
src/core/parser/components/declarations.rs

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-07T04:24:02.234Z

Learnt from: mkpro118
PR: mkpro118/prisma-rs#12
File: src/core/scanner/tokens.rs:119-188
Timestamp: 2025-09-07T04:24:02.234Z
Learning: In src/core/scanner/tokens.rs, the developer prefers using qualified names like `TokenType::String` instead of importing variants into the use statement to avoid confusion with Rust's built-in types like `String`. This design choice prioritizes code clarity over brevity.

Applied to files:

src/core/scanner/tokens.rs

📚 Learning: 2025-09-07T04:23:03.787Z

Learnt from: mkpro118
PR: mkpro118/prisma-rs#12
File: src/core/scanner/lexer.rs:561-579
Timestamp: 2025-09-07T04:23:03.787Z
Learning: In the lexer implementation for the prisma-rs project, the API contract expects TokenRecognizer methods `can_handle` and `consume` to be called sequentially without stream modification between calls. Violating this contract (e.g., modifying the stream) is considered user error and panic behavior is acceptable rather than defensive programming with safe lookups.

Applied to files:

src/core/scanner/lexer.rs

🧬 Code graph analysis (3)

src/core/parser/components/helpers.rs (1)

src/core/parser/components/doc_integration_tests.rs (1)

tok (27-29)

src/core/parser/components/expressions.rs (1)

src/core/parser/components/helpers.rs (1)

extract_doc_text (32-42)

src/core/scanner/lexer.rs (1)

src/core/scanner/tokens.rs (3)

new (270-288)

span (298-300)

r#type (292-294)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)

GitHub Check: Test beta on ubuntu-latest --no-default-features
GitHub Check: Test nightly on macos-latest --all-features
GitHub Check: Test nightly on macos-latest
GitHub Check: Test stable on macos-latest
GitHub Check: Test nightly on windows-latest --all-features
GitHub Check: Test nightly on ubuntu-latest --all-features
GitHub Check: Test nightly on ubuntu-latest
GitHub Check: Test stable on windows-latest
GitHub Check: Test nightly on windows-latest
GitHub Check: Cargo Check (windows-latest, nightly, --all-features)
GitHub Check: Cargo Check (windows-latest, beta)
GitHub Check: Cargo Check (ubuntu-latest, nightly, --all-features)
GitHub Check: Cargo Check (ubuntu-latest, beta, --no-default-features)
GitHub Check: Cargo Check (windows-latest, beta, --all-features)
GitHub Check: Cargo Check (windows-latest, stable)
GitHub Check: Security Audit
GitHub Check: Security Audit

🔇 Additional comments (27)

src/core/scanner/tokens.rs (2)

199-205: Eq/Hash derives on SymbolLocation: LGTM

Deriving Eq/Hash is safe on u32 fields and unlocks map/set usage. No behavioral risk.

227-233: Eq/Hash derives on SymbolSpan: LGTM

Consistent with Location; enables hashing spans in diagnostics and tests.

src/core/parser/ast/mod.rs (4)

663-669: Derives on TypeRef: good and consistent

Enums with inner nodes now comparable/hashable. Works with downstream collections.

If any caches/keys rely on “type-only” identity, confirm inclusion of spans is intended.

716-722: Derives on NamedType: LGTM

Hash/Eq include span; same caveat as above.

746-752: Derives on ListType: LGTM

Box<TypeRef> hashing is fine (hashes pointee). No issues.

1080-1085: Derives on Ident: LGTM

Hash/Eq on small struct; enables use as map keys without extra wrappers.

src/core/scanner/lexer.rs (14)

15-23: Clear error-recovery docs

Good explanation of sticky vs. advancing behavior. Matches implementation below.

338-356: Recover-on-error flag wiring: LGTM

Field added and defaulted to false. No API break.

367-376: default_for_input_with_recovery: good ergonomics

Simple constructor to flip recovery at creation time.

377-383: with_error_recovery builder: LGTM

Fluent toggle; doesn’t consume external state.

422-439: Advance-on-error path: correct and minimal

Advances a single char only when recovery is enabled; preserves error span/message. Matches contract from previous learnings.

625-631: KeywordRecognizer uses static map: LGTM

Indexing with [] is safe given prior contains_key and our can_handle/consume contract.

Also applies to: 638-640

781-801: String literal escape handling: correct even/odd backslash logic

Tracks backslash runs and treats quotes accordingly. Good.

1265-1273: Perf sanity test for long backslashes: nice

Covers the new escape logic under stress.

1303-1314: Exponent with sign tests: good coverage

Validates e+ handling.

1331-1345: Error span precision tests: excellent

Asserts the 'e' column exactly. Helps tooling UX.

1347-1359: Minus + exponent span test: LGTM

Covers signed mantissa case.

1370-1390: 1. and -1. split behavior: correct and tested

Ensures dot becomes a separate token when no following digit.

1393-1403: Even/odd backslash before quote: good tests

Directly validates the new string-terminator rule.

1532-1564: Recovery mode tests: great UX differentiation

Confirms sticky default vs. opt-in progress. Clear behavior.

src/core/parser/components/doc_integration_tests.rs (1)

20-21: Switching DocComment to inner text is correct.

Matches helpers::extract_doc_text semantics; expectations below look consistent.

src/core/parser/components/helpers.rs (2)

128-134: Tests updated to inner-text convention are accurate.

Covers basic variants; assertions match the new rule.

136-149: Nice edge-case coverage.

Verifies “remove only one space” and preserves tabs/other whitespace.

src/core/parser/components/expressions.rs (4)

1640-1644: Whitespace-only docs expectations look correct.

Confirms single leading-space removal and preservation of remaining spaces.

1703-1710: Prefix-space case validated.

Matches helpers::extract_doc_text behavior.

1727-1735: Extra-leading-spaces case validated.

Only one space stripped; trailing spaces preserved. Good.

1753-1766: Docs span test updated to inner-text tokens; assertions remain correct.

Span logic unaffected by inner-text switch.

mkpro118 added 3 commits September 14, 2025 02:43

mkpro118 self-assigned this Sep 14, 2025

coderabbitai bot changed the title ~~@coderabbitai~~ Add lexer recovery; improve string/number lexing; derive Eq/Hash Sep 14, 2025

coderabbitai bot reviewed Sep 14, 2025

View reviewed changes

mkpro118 merged commit 54003da into main Sep 14, 2025
72 of 73 checks passed

mkpro118 deleted the enhance-scanner-and-parser branch September 14, 2025 11:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add lexer recovery; improve string/number lexing; derive Eq/Hash #19

Add lexer recovery; improve string/number lexing; derive Eq/Hash #19

Uh oh!

mkpro118 commented Sep 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add lexer recovery; improve string/number lexing; derive Eq/Hash #19

Add lexer recovery; improve string/number lexing; derive Eq/Hash #19

Uh oh!

Conversation

mkpro118 commented Sep 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkpro118 commented Sep 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 14, 2025 •

edited

Loading