Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a Python PEG parser in Rust #566

Merged
merged 208 commits into from
Dec 21, 2021
Merged

Implement a Python PEG parser in Rust #566

merged 208 commits into from
Dec 21, 2021

Conversation

zsol
Copy link
Member

@zsol zsol commented Dec 17, 2021

Summary

This massive PR implements an alternative Python parser that will allow LibCST to parse Python 3.10's new grammar features. The parser is implemented in Rust, but it's turned off by default through the LIBCST_PARSER_TYPE environment variable. Set it to native to enable. The PR also enables new CI steps that test just the Rust parser, as well as steps that produce binary wheels for a variety of CPython versions and platforms.

Note: this PR aims to be roughly feature-equivalent to the main branch, so it doesn't include new 3.10 syntax features. That will be addressed as a follow-up PR.

The new parser is implemented in the native/ directory, and is organized into two rust crates: libcst_derive contains some macros to facilitate various features of CST nodes, and libcst contains the parser itself (including the Python grammar), a tokenizer implementation by @bgw, and a very basic representation of CST nodes. Parsing is done by

  1. tokenizing the input utf-8 string (bytes are not supported at the Rust layer, they are converted to utf-8 strings by the python wrapper)
  2. running the PEG parser on the tokenized input, which also captures certain anchor tokens in the resulting syntax tree
  3. using the anchor tokens to inflate the syntax tree into a proper CST

These steps are wrapped into a high-level parse_module API here, along with parse_statement and parse_expression functions which all just accept the input string and an optional encoding.

These Rust functions are exposed to Python here using the excellent PyO3 library, plus an IntoPy trait which is mostly implemented via a macro in libcst_derive.

Test Plan

Apart from Rust unit tests and roundtrip tests, the new implementation is also tested by running the pure python tests on them (by setting LIBCST_PARSER_TYPE=native before running python -m unittest). All of these are part of CI steps.

Known missing features

Apart from the soon-to-come 3.10 syntactic features mentioned above, here's an incomplete list of the features the Rust implementation lacks compared to the pure Python parser:

  • No support for limiting the parser to a certain Python version. The parser will accept input that's valid in any of Python 3.9, 3.8, 3.7, or 3.6 at the moment.
  • The native parser API only accepts utf-8 encoded strings. Support for bytes is provided in the wrapping python layer.
  • In some places the old parser returned a tuple of CSTNodes, the new parser returns a list of them (see adjusted unit tests)

bgw and others added 30 commits December 7, 2021 10:53
This adds an experimental new native extension, with the goal of
improving performance, currently just reimplementing
`whitespace_parser`.

This appears to make parsing marginally faster overall, but the
performance gains are pretty limited because whitespace parsing is a
small portion of the overall parsing process, and there's some added
overhead of converting types (mostly unicode strings) between rust and
python.

Instead, this helps show what's needed to port part of the codebase over
to rust in an example that's small enough to be digestible, but big
enough to be more than a trivial toy.

I originally started by trying to port the tokenizer to rust, since it
takes a larger portion of time and it's infrequently modified (making it
a good candidate), but found issues with the implementation direction I
was taking, so I scrapped that for now.
Ports CPython's tokenize.c to rust, and adds features to it to (mostly)
match the behavior of the Parso-based tokenizer.

Some tests are currently failing, but most of this is just due to
differences between error handing in the Python in Rust implementations.
I'll either fix or disable these cases in the future.

Like the whitespace_parser commit, this appears to make parsing faster
by a small margin, but tokenizing is just one small part of the overall
parsing process.
There are also tests and a simple benchmark.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 17, 2021
@zsol zsol mentioned this pull request Dec 17, 2021
@bgw
Copy link
Contributor

bgw commented Dec 17, 2021

Oh my god. The mad lad, @zsol actually did it. Congratulations!

I don't think I can reasonably read through all this code, but I might look through it a bit later. None of that should prevent you from merging though.

@zsol
Copy link
Member Author

zsol commented Dec 17, 2021

Thanks! I'd still appreciate any review or thoughts for even parts of the code, because I'm 100% sure there are some questionable decisions in there that would benefit from ... questioning :) Can incorporate any changes in future PRs.

@stroxler
Copy link
Contributor

Wow, there's a lot here! I'll also try to read through parts of it.

My view is that if tests pass we can probably go ahead and merge it even if there are questionable bits as long as the CST data itself is reasonable, nothing prevents us from revisiting the code later and getting 3.10 PEG support landed is pretty time-sensitive.

Before I try to dive in, a few high level questions:

(1) are there any particular parts of the code I should focus on for review? I'm coming off of making some changes to the cpython parser for PEP 677 so I can try to focus on that, but there's a big difference between making one change, even a nontrivial one, versus trying to take in the whole thing at once!

(2) are there any changes you'd like to make on top of this? It might help me build context to not just read the code but make a small modification or two if there's something you already have in mind.

(3) for code that can parse using either the old or new parser, is the produced CST the same? My impression is yes given your description but just want to double-check. In my mind even if there are messy bits, if this is true we should err on the side of merging fast, especially since it's off by default.

@zsol
Copy link
Member Author

zsol commented Dec 19, 2021

are there any particular parts of the code I should focus on for review? I'm coming off of making some changes to the cpython parser for PEP 677 so I can try to focus on that, but there's a big difference between making one change, even a nontrivial one, versus trying to take in the whole thing at once!

In that case, I'd recommend grammar.rs which implements step 2 from the PR description. Specifically I'm a bit worried about accepting syntax that shouldn't be accepted.

are there any changes you'd like to make on top of this? It might help me build context to not just read the code but make a small modification or two if there's something you already have in mind.

Yeah, I'd like to add match statements myself, but you could try extending the grammar to support parenthesized context managers (https://bugs.python.org/issue12782)

for code that can parse using either the old or new parser, is the produced CST the same?

Yes! The only case where this isn't true is the tuple-vs-list problem I mention in the PR description. PyO3 automatically converts rust's List<_> into python List[...], but in some cases the pure python parser produces a tuple.

@@ -1120,6 +1121,8 @@ def test_invalid(self, **kwargs: Any) -> None:
)
)
def test_versions(self, **kwargs: Any) -> None:
if is_native() and not kwargs.get("expect_success", True):
self.skipTest("parse errors are disabled for native parser")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a plan for making them work? This is an important part of the API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's possible to make it work, but my priority is to get 3.10 working first. The old parser won't be removed until this feature is supported by the new one.

@@ -114,6 +114,23 @@ def _detect_future_imports(tokens: Iterable[Token]) -> FrozenSet[str]:
return frozenset(future_imports)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure this has good tests. Does this avoid having to tokenize twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think this code is ever getting run in native mode. Will probably have to be reimplemented on the rust side in the future. As far as I can tell this is only being used to figure out the exact grammar version to be used.

@thatch
Copy link
Contributor

thatch commented Dec 20, 2021

Can you expand more on

No support for limiting the parser to a certain Python version

Are you intending on adding that before a non-pre release? I put a decent chunk of effort in supporting old versions and even have a draft going back to 2.3 that I'd be happy to rework if version-dependent support existed. See #275.

@stroxler
Copy link
Contributor

@thatch my impression of what @zsol was saying about python versions is just that the parser won't throw errors if you use Python 3.9 syntax in a 3.6 codebase.

In my mind this isn't a huge problem, generally there are other ways to get pretty fast feedback about syntax errors, and it's not critical that LibCST deal with it... if there's an exception to that I'd say it's actually on the output of codemods, not the input, but presumably that's not the parser anyway

@bgw
Copy link
Contributor

bgw commented Dec 21, 2021

the parser won't throw errors if you use Python 3.9 syntax in a 3.6 codebase

How does this work for bits of Python grammar that were changed in backwards-incompatible ways? In older versions of Python, it was valid to use async and await as variable names. In 3.7+, it's no longer valid. However, if you enable the async/await tokenizer hacks on 3.7+ code, you might run into issues with async generator expressions (https://bugs.python.org/issue31708).

Copy link
Contributor

@stroxler stroxler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the grammar itself (not the entire grammar.rs file, just up to line 1180 or so) end-to-end and compared to the main grammar at
https://docs.python.org/3.11/reference/grammar.html
as well as, where necessary, the full cpython PEG at
https://github.com/python/cpython/blob/main/Grammar/python.gram

Overall it looks good to me, especially in light of having test coverage.

I flagged a number of things:

  • [substantive] I think I only found 2 or 3 things that might be wrong, and
    none of them seem like they ought to block merge of an off-by-default
    new parser - I'd say address them if you want or we can merge and follow
    up later

  • [nit / curiousity] a few places the rules seemed awkward, or slightly out of
    sync with the python official grammar in ways that appeared unnecessary.
    I definitely may be missing something though, marked these as [nit], [curiosity]
    or [both] and it's probably fine to merge without addressing (although if there
    are reasons I'd appreciate an explanation since I'd learn something!)

  • [compat] I've marked rules that are not in a great order or otherwise have
    cosmetic issues relative to the official grammar as [compat]. Feel free to
    ignore these, they shouldn't block land but I do think I might follow up
    later with a reorder because that would make checking them against
    one another easier going forward

libcst/_parser/entrypoints.py Outdated Show resolved Hide resolved
native/libcst/src/parser/grammar.rs Show resolved Hide resolved
native/libcst/src/parser/grammar.rs Outdated Show resolved Hide resolved
native/libcst/src/parser/grammar.rs Outdated Show resolved Hide resolved
native/libcst/src/parser/grammar.rs Outdated Show resolved Hide resolved
native/libcst/src/parser/grammar.rs Outdated Show resolved Hide resolved
native/libcst/src/parser/grammar.rs Outdated Show resolved Hide resolved
native/libcst/src/parser/grammar.rs Outdated Show resolved Hide resolved
native/libcst/src/parser/grammar.rs Outdated Show resolved Hide resolved
make_import_alias(n, asname)
}

rule dotted_name() -> NameOrAttribute<'a>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[compat] this rule diverges pretty heavily from the latest grammar which uses a recursive rule instead.

I'm pretty certain they are equivalent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget why this was necessary. Maybe it wasn't? Not sure.

@stroxler
Copy link
Contributor

How does this work for bits of Python grammar that were changed in backwards-incompatible ways?

Not sure - I was mostly describing my understanding of @zsol's comment about supporting all versions, but it does appear to me that the new parser is using async / await as dedicated tokens.

I wonder if a solution is to only use the new parser, when we roll it out as default, only for 3.7+? The main motivation to move to the PEG is to properly support newer python versions. If we want to support 3.6 properly with the native parser we could always put the time into that later.

@zsol zsol merged commit c02de9b into Instagram:main Dec 21, 2021
@amyreese
Copy link
Member

No support for limiting the parser to a certain Python version. The parser will accept input that's valid in any of Python 3.9, 3.8, 3.7, or 3.6 at the moment.

@zsol can you confirm if you intend to add support for this functionality before the Rust parser becomes enabled by default? ie, have separate grammar for each major release of CPython, and ensure strict parsing when given a target version?

There is a lot of work that Tim had in-flight for supporting older releases of CPython, potentially back to 2.x versions, that I don't want to lose, as well as the functionality in guaranteeing correctness and safety when parsing/modifying/dumping trees for production code. Eg, when writing codemods, we want to know that if we ask to parse with 3.7 code, that we're not getting trees/nodes/grammar that are only available in 3.8 or 3.9, and vice versa.

@zsol
Copy link
Member Author

zsol commented Dec 22, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants