Skip to content

Some tweaks for backtick strings #45099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 5, 2021

Conversation

elibarzilay
Copy link
Contributor

@elibarzilay elibarzilay commented Jul 19, 2021

  1. getRawLiteral(): barf if currentSourceFile is missing, since if
    it is, then the following getSourceTextOfNodeFromSourceFile will
    return a bogus "".

  2. One || -> ?? change.

  3. backtickQuoteEscapedCharsRegExp: escape the usual control
    characters except for a simple LF. This code does get used to
    generate backtick strings when rawText is not given, and not
    escaping things like TAB characters can get mangled by editor
    settings. Worse, not escaping a CRLF and putting it verbatim in sthe
    string source will interpret it as LF, so add a special case for
    escaping these as \r\n.
    Added test.

Related to #44313 and #40625.

@typescript-bot typescript-bot added Author: Team For Uncommitted Bug PR for untriaged, rejected, closed or missing bug labels Jul 19, 2021
@elibarzilay elibarzilay requested a review from rbuckton July 19, 2021 15:51
// Template strings should be preserved as much as possible
const backtickQuoteEscapedCharsRegExp = /[\\`]/g;
// Template strings preserve simple LF newlines, still encode CRLF (or CR)
const backtickQuoteEscapedCharsRegExp = /\r\n|[\\\`\u0000-\u001f\t\v\f\b\r\u2028\u2029\u0085]/g;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this change? I'm unclear on why we wouldn't want to preserve these. backtickQuoteEsacpedCharsRegExp is only used when the quote character is CharacterCodes.backtick, which mean's we're preserving the raw text of the template string as much as possible, so it seems like adding extra escapes would break anyone using the raw strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I'll see if I can add a test for this.

  • Re the reason, I listed two reasons in the commit log:

    • Not escaping things like a TAB or a FF character can get mangled by editor style settings
    • Not escaping a CRLF newline and instead having the template literal have a CRLF will be interpreted as a regular newline, and result in a string with just LF. (It's even more likely to get mangled by your editor, but even if not you'll get that problem.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Not escaping things like a TAB or a FF character can get mangled by editor style settings

Yeah, I can see that being a possible issue, however I'm still not sure "<TAB>foo" should become `\tfoo` since it is still semantically different between the raw and cooked forms. There are only two ways to get to this code path: 1) Via a user-defined transformation, in which case we would want to preserve what was written, or 2) via one of our code fixes. If we want to normalize TAB or FF in the code fix, that's fine, but I'm concerned whether this would affect anyone using custom transforms. This is a small enough corner case its probably reasonable however, and we can address it if there's feedback.

  • Not escaping a CRLF newline and instead having the template literal have a CRLF will be interpreted as a regular newline, and result in a string with just LF. (It's even more likely to get mangled by your editor, but even if not you'll get that problem.)

StringLiteral can't contain a raw CRLF unless its part of a LineContinuation, which becomes an empty code sequence, per NOTE 2 of https://tc39.es/ecma262/#sec-literals-string-literals and in https://tc39.es/ecma262/#sec-static-semantics-sv, so that wouldn't matter in a StringLiteralTemplateLiteral conversion. The escape sequence \r\n in a string literal should remain as the escape sequence \r\n in the template literal, but we shouldn't just convert a raw CRLF into the escape sequence as that could impact custom transforms (which I could see taking a reliance on the CRLF->LF normalization of template literals).

So, -1 on converting CRLF into the escaped form, +1 on converting other characters (at least until someone raises an issue).

@elibarzilay elibarzilay force-pushed the backticks-etc branch 2 times, most recently from cddfe23 to 5c089c3 Compare July 31, 2021 03:59
@elibarzilay
Copy link
Contributor Author

@rbuckton, I added a test. The only place I saw this code used is when printing types, so that's what the test is using. (And the use of CRLF is to stress that this was actually broken before this change.)

Copy link
Contributor

@rbuckton rbuckton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% clear on how the test added tests the changes to the RegExp, since that test relies on an existing source file, from which we would pull the source text and wouldn't use backtickQuoteEscapedCharsRegExp. You would need to add a unit test that exercises the NodeFactory API to create and print a node with custom text/rawText to prove out this change.

// Template strings should be preserved as much as possible
const backtickQuoteEscapedCharsRegExp = /[\\`]/g;
// Template strings preserve simple LF newlines, still encode CRLF (or CR)
const backtickQuoteEscapedCharsRegExp = /\r\n|[\\\`\u0000-\u001f\t\v\f\b\r\u2028\u2029\u0085]/g;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Not escaping things like a TAB or a FF character can get mangled by editor style settings

Yeah, I can see that being a possible issue, however I'm still not sure "<TAB>foo" should become `\tfoo` since it is still semantically different between the raw and cooked forms. There are only two ways to get to this code path: 1) Via a user-defined transformation, in which case we would want to preserve what was written, or 2) via one of our code fixes. If we want to normalize TAB or FF in the code fix, that's fine, but I'm concerned whether this would affect anyone using custom transforms. This is a small enough corner case its probably reasonable however, and we can address it if there's feedback.

  • Not escaping a CRLF newline and instead having the template literal have a CRLF will be interpreted as a regular newline, and result in a string with just LF. (It's even more likely to get mangled by your editor, but even if not you'll get that problem.)

StringLiteral can't contain a raw CRLF unless its part of a LineContinuation, which becomes an empty code sequence, per NOTE 2 of https://tc39.es/ecma262/#sec-literals-string-literals and in https://tc39.es/ecma262/#sec-static-semantics-sv, so that wouldn't matter in a StringLiteralTemplateLiteral conversion. The escape sequence \r\n in a string literal should remain as the escape sequence \r\n in the template literal, but we shouldn't just convert a raw CRLF into the escape sequence as that could impact custom transforms (which I could see taking a reliance on the CRLF->LF normalization of template literals).

So, -1 on converting CRLF into the escaped form, +1 on converting other characters (at least until someone raises an issue).

@elibarzilay
Copy link
Contributor Author

I think that there's some confusion here. This is no longer related to the string literal code fix, since I made it hand out a rawText form in the other PR. The only place that I saw that this code affects is the output of string literal types, and that's what the test uses. Also, in the context of the type, I thought that it was expected to not always have a source code for the type since these strings are shuffled in the checker--?

In any case, if you run the same test on main, you'll see two differences:

  1. The .d.ts output

    declare const f: (hdr: string, val: number) => `${string}:<TAB>${number}
    `;
    

    has a literal TAB character (regardless of the code using \t) which is the possibly bad behavior; and the \r\n in the code turns into a literal CRLF which leads to broken output. In contrast, with my change, you get \t and \r\n respectively.

  2. The same thing happens in the error output from the file, which is not as important, but still confusing. (Plus, the fact that these strings get to error output further supports using escapes.)

If this is indeed the only way to exercise this code, then maybe escape plain LFs (as \ns) too? (Instead of the exception for \r\n)

1. `getRawLiteral()`: barf if `currentSourceFile` is missing, since if
   it is, then the following `getSourceTextOfNodeFromSourceFile` will
   return a bogus `""`.

2. One `||` -> `??` change.

3. `backtickQuoteEscapedCharsRegExp`: escape the usual control
   characters except for a simple LF.  This code does get used to
   generate backtick strings when `rawText` is not given, and not
   escaping things like TAB characters can get mangled by editor
   settings.  Worse, not escaping a CRLF and putting it verbatim in sthe
   string source will interpret it as LF, so add a special case for
   escaping these as `\r\n`.
   Added test.

Related to microsoft#44313 and microsoft#40625.
@rbuckton
Copy link
Contributor

rbuckton commented Aug 4, 2021

Do we not normalize CRLF in template literal types? If so, that seems like a big disparity with template literal expressions (but perhaps unavoidable?).

@rbuckton
Copy link
Contributor

rbuckton commented Aug 4, 2021

[...] and the \r\n in the code turns into a literal CRLF which leads to broken output. [...]

How does this break the output? Do we not handle CRLF when parsing template literal types?

@elibarzilay
Copy link
Contributor Author

Do we not normalize CRLF in template literal types?

I'm not sure what normalization you're talking about... The actual type has \r\n, and since there's no rawText for literals in types, then this code gets the CRLF and creates a (broken) rawText.

[...] and the \r\n in the code turns into a literal CRLF which leads to broken output. [...]

How does this break the output? Do we not handle CRLF when parsing template literal types?

The original type has \r\n, and the resulting dts line is:

declare const f: (hdr: string, val: number) => `${string}:	${number}
`;

so the CRLF from the type leads to a declaration with just a LF.

Copy link
Contributor

@rbuckton rbuckton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to add a breaking change notice for API consumers that only passing text to one of the factory methods to create template literals will have different escaping rules, and they can supply their own rawText if our escaping doesn't match what they need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Author: Team For Uncommitted Bug PR for untriaged, rejected, closed or missing bug
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants