Skip to content

Conversation

@mkpro118
Copy link
Owner

@mkpro118 mkpro118 commented Sep 14, 2025

Summary by CodeRabbit

  • New Features
    • Optional lexer error recovery to continue scanning after errors.
    • Added equality and hashing support for key syntax nodes and source span types (enables efficient comparisons and lookups).
  • Bug Fixes
    • Correct handling of escaped quotes in string literals, including long backslash sequences.
    • Stricter numeric exponent parsing with clearer errors when digits are missing.
  • Style
    • Minor display formatting tweak for literal token representation.
  • Tests
    • Expanded coverage for error recovery, string escaping, and numeric exponents; removed redundant lint suppressions.

The previous lexer implementation was brittle. It would halt at the
first sign of an error, preventing the parser from identifying other
issues in the file. It was also too permissive with invalid number
formats and inefficient when parsing certain string patterns.

This patch hardens the lexer by introducing an optional recovery mode
to gracefully handle errors and continue tokenizing, which allows for
more complete diagnostics in a single pass.

Validation for numeric literals is now stricter, and string parsing
has been made more performant.
The doc comment parsing logic has been updated to be less destructive.
Previously, it would trim all leading and trailing whitespace from comment
lines. The new implementation only removes a single optional leading space,
preserving intentional indentation and formatting. This change aligns the
parser with an updated lexer behavior where the `///` marker is stripped
earlier in the pipeline.
Core AST nodes such as `Ident`, `TypeRef`, and `QualifiedIdent` now
derive `PartialEq`, `Eq`, and `Hash`. This allows them to be compared for
equality and used in hash-based collections, a prerequisite for upcoming
semantic analysis features.

This also includes minor cleanup by removing several unnecessary clippy
`expect` attributes from test modules.
@mkpro118 mkpro118 self-assigned this Sep 14, 2025
@coderabbitai
Copy link

coderabbitai bot commented Sep 14, 2025

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The provided title "@coderabbitai" is not a descriptive PR title and does not summarize the changes in the changeset; it reads like a username/handle rather than a concise description. The changeset clearly makes substantive edits (adding PartialEq/Eq/Hash derives to AST types, introducing lexer error-recovery and string/exponent handling improvements, and updating tests), so the title should reflect the main change. As written it fails the requirement that a title be a short, clear sentence highlighting the most important change. Please replace the title with a short, specific sentence describing the primary change; for example: "Add lexer error recovery and improve string/exponent handling; add Eq/Hash derives for AST nodes" or "Enhance scanner and parser: lexer recovery, string escape fixes, and AST Eq/Hash derives". Ensure the final title is concise, focuses on the most important change, and avoids usernames or non-descriptive text.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch enhance-scanner-and-parser

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai bot changed the title @coderabbitai Add lexer recovery; improve string/number lexing; derive Eq/Hash Sep 14, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/core/scanner/lexer.rs (1)

884-915: Numeric lexer allows multiple exponents (e.g., 1e2e3)

Current loop permits more than one 'e/E' since no has_exponent guard. That lexes invalid numbers as a single literal.

Apply this fix:

 impl NumberLiteralRecognizer {
@@
     pub fn new() -> Self {
         Self
     }
 }
@@
     fn consume(
         &self,
         input: &mut dyn CharacterStream,
     ) -> Result<TokenType, LexError> {
         let mut number = String::new();
         let mut has_dot = false;
+        let mut has_exponent = false;
@@
-                'e' | 'E'
-                    if number
+                'e' | 'E'
+                    if !has_exponent
+                        && number
                         .chars()
                         .last()
                         .is_some_and(|last| last.is_ascii_digit()) =>
                 {
@@
-                    // Valid exponent header; consume it
+                    // Valid exponent header; consume it
                     number.push(ch);
                     input.advance(); // 'e' / 'E'
+                    has_exponent = true;
                     if let Some(sign) = input.current()
                         && (sign == '+' || sign == '-')
                     {
                         number.push(sign);
                         input.advance();
                     }
                 }

Add a test like assert_eq!(lex("1e2e3"), ["Literal(\"1e2\")","Identifier(\"e3\")", ...]) or expect an error per your policy.

🧹 Nitpick comments (8)
src/core/scanner/tokens.rs (1)

169-172: Literal Display spacing: good; consider aligning Identifier too

Change makes Literal formatting consistent with other variants. Identifier still prints as `Identifier( `` (no space). Consider aligning for uniformity.

Apply outside-range tweak if desired:

-                let _ = f.write_str("Identifier( `");
+                let _ = f.write_str("Identifier ( `");
src/core/parser/ast/mod.rs (1)

610-616: Derives on QualifiedIdent (PartialEq, Eq, Hash): confirm span-in-equality intent

Equality/hash include span. If you want “semantic” equality ignoring positions, consider a span-less newtype or manual impls.

src/core/scanner/lexer.rs (3)

568-591: Static keyword table via OnceLock: solid; consider phf for zero-cost init

Runtime init is one-time and fine. If you want compile-time, phf could remove the map build. Not necessary now.


1315-1322: Invalid exponent cases: good; consider adding 1e2e3

Extends negative tests; please add multiple-exponent case post-fix.


1595-1615: Invalid numbers loop: OK; broaden set

Covers mixed outcomes (error or split). Add 1e2e3 here after fixing exponent guard.

src/core/parser/components/doc_integration_tests.rs (1)

20-23: Span end-column should account for optional space and Unicode; avoid silent overflow.

Right now end = 4 + text.len() (bytes). If the inner text doesn’t start with a space, spans will be off by 1; for non-ASCII, byte length ≠ columns. Consider this:

-        TokenType::DocComment(text.to_string()),
-        (line, 1),
-        (line, 4 + u32::try_from(text.len()).unwrap_or(0)),
+        TokenType::DocComment(text.to_string()),
+        (line, 1),
+        (line, {
+            let prefix = 3 + u32::from(text.starts_with(' ')); // "///" + optional single space
+            let width = u32::try_from(text.chars().count()).unwrap_or(u32::MAX);
+            prefix.saturating_add(width)
+        }),
src/core/parser/components/helpers.rs (1)

29-38: Doc extraction rule (remove at most one leading space) looks good; compact implementation.

Equivalent but simpler and branchless:

-    if let TokenType::DocComment(text) = token.r#type() {
-        if let Some(rest) = text.strip_prefix(' ') {
-            Some(rest.to_string())
-        } else {
-            Some(text.to_string())
-        }
-    } else {
-        None
-    }
+    if let TokenType::DocComment(text) = token.r#type() {
+        Some(text.strip_prefix(' ').unwrap_or(text).to_string())
+    } else {
+        None
+    }
src/core/parser/components/expressions.rs (1)

1443-1444: Helper now emits inner-text DocComment — tests still use the "///" prefix

Update tests to construct TokenType::DocComment with the inner text (e.g. "doc comment") instead of "/// doc comment".
Occurrences: src/core/parser/components/members.rs:1737,1761,1783,1806,1825,1846,1865; src/core/parser/components/declarations.rs:1152,1164,1176,1188,1200.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67493e6 and 6e7f964.

📒 Files selected for processing (15)
  • src/core/parser/ast/mod.rs (5 hunks)
  • src/core/parser/components/attributes.rs (0 hunks)
  • src/core/parser/components/declarations.rs (0 hunks)
  • src/core/parser/components/doc_integration_tests.rs (1 hunks)
  • src/core/parser/components/expressions.rs (5 hunks)
  • src/core/parser/components/helpers.rs (2 hunks)
  • src/core/parser/components/members.rs (0 hunks)
  • src/core/parser/components/primitives.rs (0 hunks)
  • src/core/parser/components/types.rs (0 hunks)
  • src/core/parser/config.rs (0 hunks)
  • src/core/parser/progress.rs (0 hunks)
  • src/core/parser/schema_parser.rs (0 hunks)
  • src/core/parser/stream.rs (0 hunks)
  • src/core/scanner/lexer.rs (15 hunks)
  • src/core/scanner/tokens.rs (3 hunks)
💤 Files with no reviewable changes (9)
  • src/core/parser/stream.rs
  • src/core/parser/schema_parser.rs
  • src/core/parser/config.rs
  • src/core/parser/components/primitives.rs
  • src/core/parser/components/attributes.rs
  • src/core/parser/progress.rs
  • src/core/parser/components/types.rs
  • src/core/parser/components/members.rs
  • src/core/parser/components/declarations.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-07T04:24:02.234Z
Learnt from: mkpro118
PR: mkpro118/prisma-rs#12
File: src/core/scanner/tokens.rs:119-188
Timestamp: 2025-09-07T04:24:02.234Z
Learning: In src/core/scanner/tokens.rs, the developer prefers using qualified names like `TokenType::String` instead of importing variants into the use statement to avoid confusion with Rust's built-in types like `String`. This design choice prioritizes code clarity over brevity.

Applied to files:

  • src/core/scanner/tokens.rs
📚 Learning: 2025-09-07T04:23:03.787Z
Learnt from: mkpro118
PR: mkpro118/prisma-rs#12
File: src/core/scanner/lexer.rs:561-579
Timestamp: 2025-09-07T04:23:03.787Z
Learning: In the lexer implementation for the prisma-rs project, the API contract expects TokenRecognizer methods `can_handle` and `consume` to be called sequentially without stream modification between calls. Violating this contract (e.g., modifying the stream) is considered user error and panic behavior is acceptable rather than defensive programming with safe lookups.

Applied to files:

  • src/core/scanner/lexer.rs
🧬 Code graph analysis (3)
src/core/parser/components/helpers.rs (1)
src/core/parser/components/doc_integration_tests.rs (1)
  • tok (27-29)
src/core/parser/components/expressions.rs (1)
src/core/parser/components/helpers.rs (1)
  • extract_doc_text (32-42)
src/core/scanner/lexer.rs (1)
src/core/scanner/tokens.rs (3)
  • new (270-288)
  • span (298-300)
  • r#type (292-294)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: Test beta on ubuntu-latest --no-default-features
  • GitHub Check: Test nightly on macos-latest --all-features
  • GitHub Check: Test nightly on macos-latest
  • GitHub Check: Test stable on macos-latest
  • GitHub Check: Test nightly on windows-latest --all-features
  • GitHub Check: Test nightly on ubuntu-latest --all-features
  • GitHub Check: Test nightly on ubuntu-latest
  • GitHub Check: Test stable on windows-latest
  • GitHub Check: Test nightly on windows-latest
  • GitHub Check: Cargo Check (windows-latest, nightly, --all-features)
  • GitHub Check: Cargo Check (windows-latest, beta)
  • GitHub Check: Cargo Check (ubuntu-latest, nightly, --all-features)
  • GitHub Check: Cargo Check (ubuntu-latest, beta, --no-default-features)
  • GitHub Check: Cargo Check (windows-latest, beta, --all-features)
  • GitHub Check: Cargo Check (windows-latest, stable)
  • GitHub Check: Security Audit
  • GitHub Check: Security Audit
🔇 Additional comments (27)
src/core/scanner/tokens.rs (2)

199-205: Eq/Hash derives on SymbolLocation: LGTM

Deriving Eq/Hash is safe on u32 fields and unlocks map/set usage. No behavioral risk.


227-233: Eq/Hash derives on SymbolSpan: LGTM

Consistent with Location; enables hashing spans in diagnostics and tests.

src/core/parser/ast/mod.rs (4)

663-669: Derives on TypeRef: good and consistent

Enums with inner nodes now comparable/hashable. Works with downstream collections.

If any caches/keys rely on “type-only” identity, confirm inclusion of spans is intended.


716-722: Derives on NamedType: LGTM

Hash/Eq include span; same caveat as above.


746-752: Derives on ListType: LGTM

Box<TypeRef> hashing is fine (hashes pointee). No issues.


1080-1085: Derives on Ident: LGTM

Hash/Eq on small struct; enables use as map keys without extra wrappers.

src/core/scanner/lexer.rs (14)

15-23: Clear error-recovery docs

Good explanation of sticky vs. advancing behavior. Matches implementation below.


338-356: Recover-on-error flag wiring: LGTM

Field added and defaulted to false. No API break.


367-376: default_for_input_with_recovery: good ergonomics

Simple constructor to flip recovery at creation time.


377-383: with_error_recovery builder: LGTM

Fluent toggle; doesn’t consume external state.


422-439: Advance-on-error path: correct and minimal

Advances a single char only when recovery is enabled; preserves error span/message. Matches contract from previous learnings.


625-631: KeywordRecognizer uses static map: LGTM

Indexing with [] is safe given prior contains_key and our can_handle/consume contract.

Also applies to: 638-640


781-801: String literal escape handling: correct even/odd backslash logic

Tracks backslash runs and treats quotes accordingly. Good.


1265-1273: Perf sanity test for long backslashes: nice

Covers the new escape logic under stress.


1303-1314: Exponent with sign tests: good coverage

Validates e+ handling.


1331-1345: Error span precision tests: excellent

Asserts the 'e' column exactly. Helps tooling UX.


1347-1359: Minus + exponent span test: LGTM

Covers signed mantissa case.


1370-1390: 1. and -1. split behavior: correct and tested

Ensures dot becomes a separate token when no following digit.


1393-1403: Even/odd backslash before quote: good tests

Directly validates the new string-terminator rule.


1532-1564: Recovery mode tests: great UX differentiation

Confirms sticky default vs. opt-in progress. Clear behavior.

src/core/parser/components/doc_integration_tests.rs (1)

20-21: Switching DocComment to inner text is correct.

Matches helpers::extract_doc_text semantics; expectations below look consistent.

src/core/parser/components/helpers.rs (2)

128-134: Tests updated to inner-text convention are accurate.

Covers basic variants; assertions match the new rule.


136-149: Nice edge-case coverage.

Verifies “remove only one space” and preserves tabs/other whitespace.

src/core/parser/components/expressions.rs (4)

1640-1644: Whitespace-only docs expectations look correct.

Confirms single leading-space removal and preservation of remaining spaces.


1703-1710: Prefix-space case validated.

Matches helpers::extract_doc_text behavior.


1727-1735: Extra-leading-spaces case validated.

Only one space stripped; trailing spaces preserved. Good.


1753-1766: Docs span test updated to inner-text tokens; assertions remain correct.

Span logic unaffected by inner-text switch.

@mkpro118 mkpro118 merged commit 54003da into main Sep 14, 2025
72 of 73 checks passed
@mkpro118 mkpro118 deleted the enhance-scanner-and-parser branch September 14, 2025 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants