-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide Syntax Checking for Regular Expressions #55600
Provide Syntax Checking for Regular Expressions #55600
Conversation
Looks like you're introducing a change to the public API surface area. If this includes breaking changes, please document them on our wiki's API Breaking Changes page. Also, please make sure @DanielRosenwasser and @RyanCavanaugh are aware of the changes, just as a heads up. |
be6b107
to
13ef487
Compare
I am not sure if there are conflicts between the licenses of each test262 test file and this repo. If that's the case, I will remove that commit. |
You just have to include the license, we have it in other places pretty sure. We don't distribute the tests so it's theoretically not a problem. |
@typescript-bot test top100 |
Heya @jakebailey, I've started to run the parallelized Definitely Typed test suite on this PR at 13ef487. You can monitor the build here. Update: The results are in! |
Heya @jakebailey, I've started to run the tarball bundle task on this PR at 13ef487. You can monitor the build here. |
Heya @jakebailey, I've started to run the regular perf test suite on this PR at 13ef487. You can monitor the build here. Update: The results are in! |
Heya @jakebailey, I've started to run the diff-based top-repos suite on this PR at 13ef487. You can monitor the build here. |
Heya @jakebailey, I've started to run the diff-based user code test suite (tsserver) on this PR at 13ef487. You can monitor the build here. Update: The results are in! |
Heya @jakebailey, I've started to run the diff-based top-repos suite (tsserver) on this PR at 13ef487. You can monitor the build here. Update: The results are in! |
Heya @jakebailey, I've started to run the diff-based user code test suite on this PR at 13ef487. You can monitor the build here. Update: The results are in! |
Hey @jakebailey, I've packed this into an installable tgz. You can install it for testing by referencing it in your
and then running |
@jakebailey Here are the results of running the user test suite comparing Everything looks good! |
@jakebailey Here are the results of running the user test suite comparing There were infrastructure failures potentially unrelated to your change:
Otherwise... Something interesting changed - please have a look. Details
|
@jakebailey Here they are:
CompilerComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
StartupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea if this is something we actually want, but just leaving a couple of comments before I forget...
@jakebailey Here are some more interesting changes from running the top-repos suite Details
|
@jakebailey Here are the results of running the top-repos suite comparing Everything looks good! |
Hey @jakebailey, the results of running the DT tests are ready. Branch only errors:Package: ember__component
Package: semantic-ui-dropdown
Package: semantic-ui-search
Package: twig
Package: ember__component/v3
Package: html-replace-webpack-plugin
Package: layui-src
Package: mocha
Package: prismjs
Package: semantic-ui-api
Package: semantic-ui-tab
Package: webpack-manifest-plugin
Package: ali-app
Package: semantic-ui-progress
Package: connect-livereload
Package: director
|
13ef487
to
18ea3ec
Compare
18ea3ec
to
1a5228d
Compare
@jakebailey I just removed all the unwanted changes (including the unnecessary refactoring). Please ignore what you have previously reviewed since they are all reverted. |
src/compiler/types.ts
Outdated
@@ -7350,7 +7366,7 @@ export const enum ScriptKind { | |||
// NOTE: We must reevaluate the target for upcoming features when each successive TC39 edition is ratified in | |||
// June of each year. This includes changes to `LanguageFeatureMinimumTarget`, `ScriptTarget`, | |||
// transformers/esnext.ts, commandLineParser.ts, and the contents of each lib/esnext.*.d.ts file. | |||
export const enum ScriptTarget { | |||
export enum ScriptTarget { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated and should be reverted; the const enum here is a perf optimization when using tsc emit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required to get the name of the target for ts1501. Perhaps there’s a better way to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you definitely do not want to use the reverse mapping for anything except debugging in the TS repo. I'm not sure what we typically use to do this sort of message. I think we just hardcode them in the diagnostics but that doesn't sound fun. But we shouldn't use the reverse mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @sheetalkamat, it looks like the thing to do is:
forEachEntry(targetOptionDeclaration.type, (value, key) => value === getEmitScriptTarget(options) ? key : undefined)
It would be nice to pull that code out into a utility function as we have copied that in at least two places. (Though, I wouldn't actually accept the options, but rather the target value itself; I assume that's what you need.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve done the refactor for you. (Better use reverse mapping though, I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then all the target names have become lowercase and ES2015
is now es6
. Are they desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve done the refactor for you. (Better use reverse mapping though, I think)
The reverse mapping isn't guaranteed to be stable if two enum values end up with the same value, but maybe that's alright. In any case, we just can't turn this into a non-const enum without consequence. It may be possible to extract formatEnum
out of ts.Debug
, but that would require other changes and I think can come later.
But then all the target names have become lowercase and ES2015 is now es6. Are they desirable?
I don't feel strongly, but @DanielRosenwasser might. This does theoretically match other diagnostics (e.g. see the Parse empty options of --target.js
baseline). But, we're definitely inconsistent.
The es6
thing is the worst bit, and that comes down to the order of the array. We don't ever refer to to it as es6
in any diagnostic, only es2015
, so that's not great. Perhaps we need to iterate the list in reverse. Or, just hardcode a switch case like getDefaultLibFileName
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, what I just meant to say was the makeReverseMap
method in the scanner (I just realized this is a local function, perhaps we should move it to utilities)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If by makeReverseMap
, you mean that something that reads the enum object itself to build the mapping, I wouldn't want to do that. We have ts.Debug.formatEnum
which does that, but we don't use it for anything other than debugging or debug failures because it's not guaranteed to be human readable or consistent, just "good enough" for us to read when debugging or viewing stack traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I mean that on targetOptionDeclaration.type
src/compiler/scanner.ts
Outdated
// Although nested character class is allowed in Unicode Sets mode, | ||
// an unescaped slash is nevertheless invalid even in character class in Unicode mode. | ||
// Thus we can simply ignore nested character class in the first pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what's happening is you're giving up the opportunity for a better error in cases like this:
var x = /[[\x00-\xff]&&/]/v;
Could you at least add an example into the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be relatively easy to recover better if you just swapped inCharacterClass
to be a stack counter. But maybe others would prefer you did that in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I once tried it but I can’t remember the rationale as to why I didn’t adopt it. I just tried it again and it really seems to give better error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know why now. /[[]/
is a totally valid regex. If we did that it would break all expressions like /[[]/.test("[")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the example you gave, [/[[\x00-\xff]&&/]/v
is already a valid JavaScript expression if v
is defined.
@@ -553,8 +597,13 @@ function isHexDigit(ch: number): boolean { | |||
return isDigit(ch) || ch >= CharacterCodes.A && ch <= CharacterCodes.F || ch >= CharacterCodes.a && ch <= CharacterCodes.f; | |||
} | |||
|
|||
function isCodePoint(code: number): boolean { | |||
return code <= 0x10FFFF; | |||
function isASCIILetter(ch: number): boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: isAsciiCharacter
or isASCIICharacter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this really just cover letters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isASCIIAlpha
is maybe less ambiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. This is essentially /[\p{ASCII}&&\p{Letter}]/v
though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. IMO isDigit
should be renamed to isASCIIDigit
too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about how the code is structured. It seems to put the responsibilities for parsing the regular expression grammar in the scanner, rather than in the parser where it belongs.
If you're goal is to extend this capability further in the future, then we're eventually going to want to use AST nodes rather than a bespoke API associated with RegularExpressionLiteral
so that we can eventually leverage other language service features, e.g., for renaming groups, etc. Keeping the actual parsing logic in the parser would facilitate that change.
Also, since createScanner
is part of our public API, it's entirely likely that someone is using it purely to scan portions of a source file. Performing a full parse of a RegExp during scan seems like unnecessary overhead for those scenarios.
@@ -553,8 +597,13 @@ function isHexDigit(ch: number): boolean { | |||
return isDigit(ch) || ch >= CharacterCodes.A && ch <= CharacterCodes.F || ch >= CharacterCodes.a && ch <= CharacterCodes.f; | |||
} | |||
|
|||
function isCodePoint(code: number): boolean { | |||
return code <= 0x10FFFF; | |||
function isASCIILetter(ch: number): boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isASCIIAlpha
is maybe less ambiguous?
} | ||
} | ||
|
||
// Alternative ::= Term* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest that if we're going to include the grammar in a comment that we use the grammarkdown
-format grammar from the spec itself as it makes it easier to compare against the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually did some sort of “elaboration” of the spec such as inlining intermediate productions instead of merely copying, aiming to make it more easily understandable at a glance
Both have their own advantages I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind BNF per se, it's more to keep things consistent and to be able to more easily eyeball spec changes as the RegExp grammar evolves over time.
const namedCapturingGroups: Set<string>[] = []; | ||
|
||
// Disjunction ::= Alternative ('|' Alternative)* | ||
function scanDisjunction(isInGroup: boolean) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to nest these functions? All of these nested functions inside scanRegularExpressionWorker
are closures and thus are essentially new objects allocated every time this function is called. These, along with scanRegularExpressionWorker
itself should be lifted out of reScanSlashToken()
if possible as it's a lot of unnecessary overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just aiming to group the functions logically, otherwise we will need to prefix all functions with scanRegExp
or move them to a new file instead. I’ll lift them out right before the PR is ready to ship for easier reviewing.
Edit: I’ll lift them out in a follow-up PR.
@typescript-bot perf test this |
@DanielRosenwasser Here they are:
tscComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
tsserverComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
startupComparison Report - baseline..pr
System info unknown
Hosts
Scenarios
Developer Information: |
So talking to a few people on our team, we still want to do some refactoring here and possibly move the logic to the parser or another component. On top of that, there may be a noticeable perf impact from this PR; however, we want this in for TypeScript 5.5 beta, and we're going to try to optimize after merging. If there is a considerable overhead, we will have to evaluate if this feature justifies the cost; but for now we are eager to push ahead on this and will merge. Thank you for the contribution and all the work you've put into this PR! |
I have resolved some of the conversations since this is merged. |
This PR provides syntax checks to regular expressions. It can also be considered a follow-up of #51837 where escapes in RegExp are still not being addressed.
It introduces 36 new diagnostic messages for errors not previously covered.
Although this comment suggested that errors are better thrown regardless of flags to avoid rescanning, the parsing behavior of the new Unicode Sets (v) mode in character class is totally different from that of the non-v modes. So I just reuse the old code to quickly skip to the end of the regex for recording the flags beforehand.
Besides the Set Notation support, the PR also includes the following Stage 3 (at the time of writing) proposals:
Additionally, some test cases from test262 are added to make up both this PR and #51837.Reverted due to excessive amount of files.
Fixes #3432 (Fixes #54744)
Follow-up Proposal
Proposed changes to be implemented in upcoming PRs
This PR provides a basis for making further changes to achieve additional type safety for the
RegExp
interface.A part of them is given below:
Usage
No more runtime errors on
match.groups.foo.length
andmatch[3].length
!