Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(coverage): enable regexp in test262 #4242

Closed
wants to merge 200 commits into from
Closed

feat(coverage): enable regexp in test262 #4242

wants to merge 200 commits into from

Conversation

Boshen
Copy link
Member

@Boshen Boshen commented Jul 13, 2024

@leaysgur This enables all test262 regexp tests, feel free to decide when's the right time to integrate.

It seems like we need to add some pointing spans on the diagnostics.

It's somewhat slow to run just c, so I always use just example parser for local development.

let span = self.start_span();

// split out pattern
let (pattern_end, flags) = self.read_regex();
let pattern_start = self.cur_token().start + 1; // +1 to exclude `/`
let pattern = &self.source_text[pattern_start as usize..pattern_end as usize];
if let Err(diagnostic) = PatternParser::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reparsing each regex eagerly in place sounds not reasonable, can we introduce another visit pass (enabled by a option) and reparse each regex then emit diagnostic at the end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most downstream users of parser may not care about whether the regex semantic is correct or not like formater, bundler.

Copy link
Contributor

@leaysgur leaysgur Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually, I'm also concerned about how to finish this PR.

With current code,

  • oxc_parser parses RegExp and reports errors, but does not seem to hold the parsed results
  • RegExp literals(/abc/) are checked, but RegExp object calls(new RegExp("abc")) are not checked

@Boshen What were your original plans?

Copy link
Contributor

@leaysgur leaysgur Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about it a little, and to organize my thoughts..., I'll answer my own questions.

RegExp literals(/abc/) are checked, but RegExp object calls(new RegExp("abc")) are not checked

It's OK.
In the case of new RegExp("string"), the code is just parsed, and if there’s an error in RegExp, it will occur at runtime.

On the other hand, in the case of /string/, the syntax must satisfy the requirements of a literal, so it should produce an error during the parsing stage, before runtime.

console.log("START");
const a = new RegExp("a{", "u"); // <- Invalid
console.log("END");

This will log "START".

console.log("START");
const a = /a{/u; // <- Invalid
console.log("END");

This, however, will not log anything.

One thing that concerns me, though, is that if invalid literal is treated as error at the parser stage, it would make it impossible to implement rules like eslint/no-invalid-regexp?

How to reuse parsed result

Perhaps parse it again from Semantic, just like with JSDoc...?

Boshen pushed a commit that referenced this pull request Aug 6, 2024
To make the #4242 tests pass.

(My `RegExp` parser tells me `/as)df/` is invalid syntax. 😂)
@leaysgur
Copy link
Contributor

leaysgur commented Aug 6, 2024

FYI:

  • The previous CodSpeed result was one where the entire source code(!) was passed to the regexp parser, not the regular expression part
  • I have not yet been started parser perf improvements

Benchmark results have been updated now.

I don't think current approach is not the best solution, but the CI is green. 😅

@IWANABETHATGUY
Copy link
Contributor

FYI:

  • The previous CodSpeed result was one where the entire source code(!) was passed to the regexp parser, not the regular expression part
  • I have not yet been started parser perf improvements

Thanks for your explanation

@leaysgur leaysgur mentioned this pull request Aug 19, 2024
4 tasks
@Boshen Boshen force-pushed the regexpp branch 3 times, most recently from a6325ee to 368364d Compare August 20, 2024 02:19
Boshen pushed a commit that referenced this pull request Aug 20, 2024
Part of #1164

## Progress updates 🗞️

Waiting for the review and advice, while thinking how to handle escaped string when `new RegExp(pat)`.

## TODOs

- [x] `RegExp(Literal = Body + Flags)#parse()` structure
- [x] Base `Reader` impl to handle both unicode(u32) and utf-16(u16) units
- [x] Global `Span` and local offset conversion
- [x] Design AST shapes
  - [x] Keep `enum` size small by `Box<'a, T>`
  - [x] Rework AST shapes
- [x] Split body and flags w/ validating literal
- [x] Parse `RegExpFlags`
- [x] Parse `RegExpBody` = `Pattern`
- [x] Parse `Pattern` > `Disjunction`
- [x] Parse `Disjunction` > `Alternative`
- [x] Parse `Alternative` > `Term`
- [x] Parse `Term` > `Assertion`
	- [x] Parse `BoundaryAssertion`
	- [x] Parse `LookaroundAssertion`
- [x] Parse `Term` > `Quantifier`
- [x] Parse `Term` > `Atom`
	- [x] Parse `Atom` > `PatternCharacter`
	- [x] Parse `Atom` > `.`
	- [x] Parse `Atom` > `\AtomEscape`
		- [x] Parse `\AtomEscape` > `DecimalEscape`
		- [x] Parse `\AtomEscape` > `CharacterClassEscape`
			- [x] Parse `CharacterClassEscape` > `\d, \D, \s, \S, \w, \W`
			- [x] Parse `CharacterClassEscape` > `\p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}`
		- [x] Parse `\AtomEscape` > `CharacterEscape`
			- [x] Parse `CharacterEscape` > `ControlEscape`
			- [x] Parse `CharacterEscape` > `c AsciiLetter`
			- [x] Parse `CharacterEscape` > `0`
			- [x] Parse `CharacterEscape` > `HexEscapeSequence`
			- [x] Parse `CharacterEscape` > `RegExpUnicodeEscapeSequence`
			- [x] Parse `CharacterEscape` > `IdentityEscape`
		- [x] Parse `\AtomEscape` > `kGroupName`
	- [x] Parse `Atom` > `[CharacterClass]`
    	- [x] Parse `[CharacterClass]` > `ClassContents` > `[~UnicodeSetsMode] NonemptyClassRanges`
    	- [x] Parse `[CharacterClass]` > `ClassContents` > `[+UnicodeSetsMode] ClassSetExpression`
          - [x] Parse `ClassSetExpression` > `ClassUnion`
          - [x] Parse `ClassSetExpression` > `ClassIntersection`
          - [x] Parse `ClassSetExpression` > `ClassSubtraction`
          - [x] Parse `ClassSetExpression` > `ClassSetOperand`
          - [x] Parse `ClassSetExpression` > `ClassSetRange`
          - [x] Parse `ClassSetExpression` > `ClassSetCharacter`
	- [x] Parse `Atom` > `(GroupSpecifier)`
	- [x] Parse `Atom` > `(?:Disjunction)`
- [x] Annex B
    - [x] Parse `QuantifiableAssertion`
	- [x] Parse `ExtendedAtom`
      - [x] Parse `ExtendedAtom` > `\ [lookahead = c]`
      - [x] Parse `ExtendedAtom` > `InvalidBracedQuantifier`
      - [x] Parse `ExtendedAtom` > `ExtendedPatternCharacter`
      - [x] Parse `ExtendedAtom` > `\AtomEscape` > `CharacterEscape` > `LegacyOctalEscapeSequence`
- [x] Early errors
	- [x] Pattern :: Disjunction(1/2)
	- [x] Pattern :: Disjunction(2/2)
	- [x] QuantifierPrefix :: { DecimalDigits , DecimalDigits }
	- [x] ExtendedAtom :: InvalidBracedQuantifier (Annex B)
	- [x] AtomEscape :: k GroupName
	- [x] AtomEscape :: DecimalEscape
	- [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(1/2)
	- [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(2/2)
	- [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(Annex B)
	- [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(1/2)
	- [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(2/2)
	- [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(Annex B)
	- [x] RegExpIdentifierStart :: \ RegExpUnicodeEscapeSequence
	- [x] RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
	- [x] RegExpIdentifierPart :: \ RegExpUnicodeEscapeSequence
	- [x] RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate
	- [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(1/2)
	- [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(2/2)
	- [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(1/2)
	- [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(2/2)
	- [x] CharacterClassEscape :: P{ UnicodePropertyValueExpression }
	- [x] CharacterClass :: [^ ClassContents ]
	- [x] NestedClass :: [^ ClassContents ]
	- [x] ClassSetRange :: ClassSetCharacter - ClassSetCharacter
- [x] Add `Span` to `Err(OxcDiagnostic::error())` calls
- [x] Perf improvement
	- [x] `Reader#peek()` should avoid `iter.next()` equivalent
	- [x] ~~Use `char` everywhere and split and push 2 surrogates(pair) for `Character`?~~
	- [x] ~~Try 1(+1) loop parsing for capturing groups?~~

## Follow up

- [x] @Boshen Test suite > #4242
  - [x] Investigate CI errors...
- Next...
  - Support ES2025 Duplicate named capturing groups?
  - Support ES20XX Stage3 Modifiers?
Base automatically changed from regexpp to main August 20, 2024 02:22
@Boshen
Copy link
Member Author

Boshen commented Aug 20, 2024

Continue in #4998

@Boshen Boshen closed this Aug 20, 2024
@Boshen Boshen deleted the regexp-tests branch August 20, 2024 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area - Parser A-prettier Area - Prettier A-transformer Area - Transformer / Transpiler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants