-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a Python PEG parser in Rust #566
Conversation
This adds an experimental new native extension, with the goal of improving performance, currently just reimplementing `whitespace_parser`. This appears to make parsing marginally faster overall, but the performance gains are pretty limited because whitespace parsing is a small portion of the overall parsing process, and there's some added overhead of converting types (mostly unicode strings) between rust and python. Instead, this helps show what's needed to port part of the codebase over to rust in an example that's small enough to be digestible, but big enough to be more than a trivial toy. I originally started by trying to port the tokenizer to rust, since it takes a larger portion of time and it's infrequently modified (making it a good candidate), but found issues with the implementation direction I was taking, so I scrapped that for now.
Ports CPython's tokenize.c to rust, and adds features to it to (mostly) match the behavior of the Parso-based tokenizer. Some tests are currently failing, but most of this is just due to differences between error handing in the Python in Rust implementations. I'll either fix or disable these cases in the future. Like the whitespace_parser commit, this appears to make parsing faster by a small margin, but tokenizing is just one small part of the overall parsing process.
There are also tests and a simple benchmark.
Oh my god. The mad lad, @zsol actually did it. Congratulations! I don't think I can reasonably read through all this code, but I might look through it a bit later. None of that should prevent you from merging though. |
Thanks! I'd still appreciate any review or thoughts for even parts of the code, because I'm 100% sure there are some questionable decisions in there that would benefit from ... questioning :) Can incorporate any changes in future PRs. |
Wow, there's a lot here! I'll also try to read through parts of it. My view is that if tests pass we can probably go ahead and merge it even if there are questionable bits as long as the CST data itself is reasonable, nothing prevents us from revisiting the code later and getting 3.10 PEG support landed is pretty time-sensitive. Before I try to dive in, a few high level questions: (1) are there any particular parts of the code I should focus on for review? I'm coming off of making some changes to the cpython parser for PEP 677 so I can try to focus on that, but there's a big difference between making one change, even a nontrivial one, versus trying to take in the whole thing at once! (2) are there any changes you'd like to make on top of this? It might help me build context to not just read the code but make a small modification or two if there's something you already have in mind. (3) for code that can parse using either the old or new parser, is the produced CST the same? My impression is yes given your description but just want to double-check. In my mind even if there are messy bits, if this is true we should err on the side of merging fast, especially since it's off by default. |
In that case, I'd recommend
Yeah, I'd like to add match statements myself, but you could try extending the grammar to support parenthesized context managers (https://bugs.python.org/issue12782)
Yes! The only case where this isn't true is the tuple-vs-list problem I mention in the PR description. PyO3 automatically converts rust's |
@@ -1120,6 +1121,8 @@ def test_invalid(self, **kwargs: Any) -> None: | |||
) | |||
) | |||
def test_versions(self, **kwargs: Any) -> None: | |||
if is_native() and not kwargs.get("expect_success", True): | |||
self.skipTest("parse errors are disabled for native parser") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a plan for making them work? This is an important part of the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's possible to make it work, but my priority is to get 3.10 working first. The old parser won't be removed until this feature is supported by the new one.
@@ -114,6 +114,23 @@ def _detect_future_imports(tokens: Iterable[Token]) -> FrozenSet[str]: | |||
return frozenset(future_imports) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure this has good tests. Does this avoid having to tokenize twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think this code is ever getting run in native mode. Will probably have to be reimplemented on the rust side in the future. As far as I can tell this is only being used to figure out the exact grammar version to be used.
Can you expand more on
Are you intending on adding that before a non-pre release? I put a decent chunk of effort in supporting old versions and even have a draft going back to 2.3 that I'd be happy to rework if version-dependent support existed. See #275. |
@thatch my impression of what @zsol was saying about python versions is just that the parser won't throw errors if you use Python 3.9 syntax in a 3.6 codebase. In my mind this isn't a huge problem, generally there are other ways to get pretty fast feedback about syntax errors, and it's not critical that LibCST deal with it... if there's an exception to that I'd say it's actually on the output of codemods, not the input, but presumably that's not the parser anyway |
How does this work for bits of Python grammar that were changed in backwards-incompatible ways? In older versions of Python, it was valid to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the grammar itself (not the entire grammar.rs file, just up to line 1180 or so) end-to-end and compared to the main grammar at
https://docs.python.org/3.11/reference/grammar.html
as well as, where necessary, the full cpython PEG at
https://github.com/python/cpython/blob/main/Grammar/python.gram
Overall it looks good to me, especially in light of having test coverage.
I flagged a number of things:
-
[substantive] I think I only found 2 or 3 things that might be wrong, and
none of them seem like they ought to block merge of an off-by-default
new parser - I'd say address them if you want or we can merge and follow
up later -
[nit / curiousity] a few places the rules seemed awkward, or slightly out of
sync with the python official grammar in ways that appeared unnecessary.
I definitely may be missing something though, marked these as [nit], [curiosity]
or [both] and it's probably fine to merge without addressing (although if there
are reasons I'd appreciate an explanation since I'd learn something!) -
[compat] I've marked rules that are not in a great order or otherwise have
cosmetic issues relative to the official grammar as [compat]. Feel free to
ignore these, they shouldn't block land but I do think I might follow up
later with a reorder because that would make checking them against
one another easier going forward
native/libcst/src/parser/grammar.rs
Outdated
make_import_alias(n, asname) | ||
} | ||
|
||
rule dotted_name() -> NameOrAttribute<'a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[compat] this rule diverges pretty heavily from the latest grammar which uses a recursive rule instead.
I'm pretty certain they are equivalent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget why this was necessary. Maybe it wasn't? Not sure.
Not sure - I was mostly describing my understanding of @zsol's comment about supporting all versions, but it does appear to me that the new parser is using I wonder if a solution is to only use the new parser, when we roll it out as default, only for 3.7+? The main motivation to move to the PEG is to properly support newer python versions. If we want to support 3.6 properly with the native parser we could always put the time into that later. |
@zsol can you confirm if you intend to add support for this functionality before the Rust parser becomes enabled by default? ie, have separate grammar for each major release of CPython, and ensure strict parsing when given a target version? There is a lot of work that Tim had in-flight for supporting older releases of CPython, potentially back to 2.x versions, that I don't want to lose, as well as the functionality in guaranteeing correctness and safety when parsing/modifying/dumping trees for production code. Eg, when writing codemods, we want to know that if we ask to parse with 3.7 code, that we're not getting trees/nodes/grammar that are only available in 3.8 or 3.9, and vice versa. |
Yeah this feature won't be lost one way or another. Having said that, I
don't think supporting python 2 grammars is very valuable at this point, as
I commented on Tim's issue recently
…On Tue, 21 Dec 2021, 22:54 John Reese, ***@***.***> wrote:
No support for limiting the parser to a certain Python version. The parser
will accept input that's valid in any of Python 3.9, 3.8, 3.7, or 3.6 at
the moment.
@zsol <https://github.com/zsol> can you confirm if you intend to add
support for this functionality before the Rust parser becomes enabled by
default? ie, have separate grammar for each major release of CPython, and
ensure strict parsing when given a target version?
There is a lot of work that Tim had in-flight for supporting older
releases of CPython, potentially back to 2.x versions, that I don't want to
lose, as well as the functionality in guaranteeing correctness and safety
when parsing/modifying/dumping trees for production code. Eg, when writing
codemods, we want to know that if we ask to parse with 3.7 code, that we're
not getting trees/nodes/grammar that are only available in 3.8 or 3.9, and
vice versa.
—
Reply to this email directly, view it on GitHub
<#566 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAQJNDGDOLXUVSGWPW4IDLUSEAUHANCNFSM5KJHLMVA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Summary
This massive PR implements an alternative Python parser that will allow LibCST to parse Python 3.10's new grammar features. The parser is implemented in Rust, but it's turned off by default through the
LIBCST_PARSER_TYPE
environment variable. Set it tonative
to enable. The PR also enables new CI steps that test just the Rust parser, as well as steps that produce binary wheels for a variety of CPython versions and platforms.Note: this PR aims to be roughly feature-equivalent to the main branch, so it doesn't include new 3.10 syntax features. That will be addressed as a follow-up PR.
The new parser is implemented in the
native/
directory, and is organized into two rust crates:libcst_derive
contains some macros to facilitate various features of CST nodes, andlibcst
contains theparser
itself (including the Python grammar), atokenizer
implementation by @bgw, and a very basic representation of CSTnodes
. Parsing is done byThese steps are wrapped into a high-level
parse_module
API here, along withparse_statement
andparse_expression
functions which all just accept the input string and an optional encoding.These Rust functions are exposed to Python here using the excellent PyO3 library, plus an
IntoPy
trait which is mostly implemented via a macro inlibcst_derive
.Test Plan
Apart from Rust unit tests and roundtrip tests, the new implementation is also tested by running the pure python tests on them (by setting
LIBCST_PARSER_TYPE=native
before runningpython -m unittest
). All of these are part of CI steps.Known missing features
Apart from the soon-to-come 3.10 syntactic features mentioned above, here's an incomplete list of the features the Rust implementation lacks compared to the pure Python parser: