-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for parsing f-string as per PEP 701 #7041
Conversation
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
CodSpeed Performance ReportMerging #7041 will degrade performances by 9.7%Falling back to comparing Summary
Benchmarks breakdown
|
e773043
to
15887c9
Compare
c51a718
to
79e1979
Compare
15887c9
to
a42b40c
Compare
79e1979
to
7660dc6
Compare
a42b40c
to
f1fd87c
Compare
7660dc6
to
6069bba
Compare
f1fd87c
to
9c65897
Compare
ff7024a
to
d34b4d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Overall this is looking good to me. It will be interesting to get some ecosystem checks once ruff compiles.
Most of my comments are nits but I think there's potential to clean up string.rs
further (and hopefully improving performance at the same time). I leave it up to you if you want to do this as part of this or a follow up PR.
/// assert!(expr.is_ok()); | ||
/// ``` | ||
pub fn parse_tokens( | ||
lxr: impl IntoIterator<Item = LexResult>, | ||
source: &str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to start grouping source
, mode
and source_path
by an abstraction but we don't have to do this as part of this PR. But we should probably do it before introducing more arguments.
// This is to account for the empty `FStringMiddle` token that is created | ||
// to check for non-parenthesized lambda expressions. | ||
if source.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand this comment with an example and why it is necessary? I know we talked about it on Discord but new and external maintainers won't know about the conversation or don't have access to it.
Overall: this is unfortunate because it requires moving all parts into a new Vec (flatten, then collect). I wonder if a dedicated token would be a better solution
for parenthesized_expr in strings { | ||
match parenthesized_expr.expr { | ||
Expr::Constant(ast::ExprConstant { | ||
value: Constant::Bytes(BytesConstant { value, .. }), | ||
.. | ||
}) => content.extend(value), | ||
_ => unreachable!("Unexpected non-bytes expression."), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should add a fast path (for the whole method) that handles the most common case of a single string:
- No need to test if it is a mixed byte / non-byte concatenation
- No need to copy the value (value.into_bytes() or use the original string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The infinite loop for something that succesfully fails in the unit test is strange
1f5c3ea
to
b7a3c51
Compare
776dd2a
to
84d5d9b
Compare
b7a3c51
to
60ead65
Compare
84d5d9b
to
1977e2f
Compare
This will be used to extract the `leading` and `trailing` text for f-string debug expressions.
c4da178
to
adb7a05
Compare
8f2c36c
to
19653a3
Compare
Hey, thanks for the comment. We're in the final phase of making the necessary changes across other parts of the codebase (linter and formatter) and then we'll merge this. |
## Summary This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. ### Grammar Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. ### `string.rs` This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ### `Constant::kind` changed in the AST ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> ### Errors With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. ## Test Plan 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. ## Benchmarks #7263 (comment) fixes: #7043 fixes: #6835
## Summary This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. ### Grammar Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. ### `string.rs` This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ### `Constant::kind` changed in the AST ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> ### Errors With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. ## Test Plan 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. ## Benchmarks #7263 (comment) fixes: #7043 fixes: #6835
## Summary This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. ### Grammar Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. ### `string.rs` This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ### `Constant::kind` changed in the AST ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> ### Errors With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. ## Test Plan 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. ## Benchmarks #7263 (comment) fixes: #7043 fixes: #6835
This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. #7263 (comment) fixes: #7043 fixes: #6835
This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. #7263 (comment) fixes: #7043 fixes: #6835
This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. #7263 (comment) fixes: #7043 fixes: #6835
This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. #7263 (comment) fixes: #7043 fixes: #6835
This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. #7263 (comment) fixes: #7043 fixes: #6835
This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. #7263 (comment) fixes: #7043 fixes: #6835
This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. #7263 (comment) fixes: #7043 fixes: #6835
This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. #7263 (comment) fixes: #7043 fixes: #6835
Summary
This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node.
Grammar
Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings.
string.rs
This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed.
Earlier, there were 2 entry points to this module:
parse_string
: Used to parse a single string literalparse_strings
: Used to parse strings which were implicitly concatenatedNow, there are 3 entry points:
parse_string_literal
: Renamed fromparse_string
parse_fstring_middle
: Used to parse aFStringMiddle
token which is basically a string literal without the quotesconcatenate_strings
: Renamed fromparse_strings
but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node.Constant::kind
changed in the ASTDiscussion in the official implementation: python/cpython#102855 (comment)
This change in the AST is when unicode strings (prefixed with
u
) and f-strings are used in an implicitly concatenated string value. For example,Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both
"foo"
and"baz some"
(implicit concatenation) would be given theu
kind:Pre 3.12 AST:
But, post Python 3.12, only the string with the
u
prefix will be assigned the value:Pre 3.12 AST:
Here are some more iterations around the change:
"foo" f"{bar}" u"baz" "no"
Pre 3.12
3.12
"foo" f"{bar}" "baz" u"no"
Pre 3.12
3.12
u"foo" f"bar {baz} realy" u"bar" "no"
Pre 3.12
3.12
Errors
With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop:
The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that.
And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc.
Test Plan
parse_suite
instead ofparse_fstrings
(doesn't exists anymore)Updated the snapshots. The change from
parse_fstrings
toparse_suite
means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges.Benchmarks
#7263 (comment)
fixes: #7043
fixes: #6835