Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-42123: Run the parser two times and only enable invalid rules on the second run #22111

Merged
merged 3 commits into from
Oct 26, 2020

Conversation

lysnikolaou
Copy link
Member

@lysnikolaou lysnikolaou commented Sep 5, 2020

Here's the easiest way I could think of to do this. Are we okay with something
as simple as this or would we like to investigate further whether we could
actually use the memo cache wherever possible?

The first parser run is only responsible for detecting whether
there is a SyntaxError or not. If there isn't, the AST gets returned.
Otherwise, the parser is run a second time with all the invalid_*
rules enabled so that all the customized error messages get produced.

Regarding bpo-41659, we can now implement a very simple fix:

--- a/Grammar/python.gram
+++ b/Grammar/python.gram
@@ -460,6 +460,7 @@ await_primary[expr_ty] (memo):
     | AWAIT a=primary { CHECK_VERSION(5, "Await expressions are", _Py_Await(a, EXTRA)) }
     | primary
 primary[expr_ty]:
+    | primary b='{' { RAISE_SYNTAX_ERROR_KNOWN_LOCATION(b, "invalid syntax") }
     | a=primary '.' b=NAME { _Py_Attribute(a, b->v.Name.id, Load, EXTRA) }
     | a=primary b=genexp { _Py_Call(a, CHECK(_PyPegen_singleton_seq(p, b)), NULL, EXTRA) }
     | a=primary '(' b=[arguments] ')' {

With this patch, I'm now getting:

➜  cpygen git:(parser-second-run) ✗ ./python
Python 3.10.0a0 (heads/parser-second-run-dirty:72fe7254b9, Sep  5 2020, 16:40:28) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a{
  File "<stdin>", line 1
    a{
     ^
SyntaxError: invalid syntax
>>> a.b.c{
  File "<stdin>", line 1
    a.b.c{
         ^
SyntaxError: invalid syntax

Thoughts?

https://bugs.python.org/issue42123

The first parser run is only responsible for detecting whether
there is a `SyntaxError` or not. If there isn't the AST gets returned.
Otherwise, the parser is run a second time with all the `invalid_*`
rules enabled so that all the customized error messages get produced.
@lysnikolaou lysnikolaou changed the title Implement running the parser a second time for the errors messages Implement running the parser a second time for the error messages Sep 5, 2020
@lysnikolaou
Copy link
Member Author

Two more things:

  1. I actually wanted to open the PR on the we-like-parsers repo, but I messed that up. I don't think there's a problem with it being here, so I'm keeping this open.
  2. Could someone actually benchmark this, since the benchmarks on my personal PC are not at all reliable?

@pablogsal
Copy link
Member

Are we okay with something
as simple as this or would we like to investigate further whether we could
actually use the memo cache wherever possible?

I think one of the requirements for Guido's approach to work is that the memo cache is preserved between restarts of the parser.

2. Could someone actually benchmark this, since the benchmarks on my personal PC are not at all reliable?

I can benchmark this tomorrow as I am currently repairing my main PC (watercooling leak 😱 ) and is not operational.

@lysnikolaou
Copy link
Member Author

Oooh, that's correct. Ok, let me have another look at this then.

@lysnikolaou
Copy link
Member Author

@pablogsal Do you have the bandwidth for another review on this? If you feel it's okay, I'm going to add an issue and a NEWS blurb.

@pablogsal
Copy link
Member

@pablogsal Do you have the bandwidth for another review on this? If you feel it's okay, I'm going to add an issue and a NEWS blurb.

Will check this if I can today or tomorrow

@pablogsal
Copy link
Member

After some benchmarking, it results that the parser with this PR is up to 10% (in xxl benchmark) and ~3% faster in parsing the stdlib. Let's merge it :)

@lysnikolaou lysnikolaou changed the title Implement running the parser a second time for the error messages bpo-42123: Implement running the parser two times and only enable invalid rules on the second run Oct 22, 2020
@lysnikolaou lysnikolaou changed the title bpo-42123: Implement running the parser two times and only enable invalid rules on the second run bpo-42123: Run the parser two times and only enable invalid rules on the second run Oct 22, 2020
@lysnikolaou
Copy link
Member Author

@ambv Do we backport this to 3.9? It's very low-risk and gives a sizable performance boost.

@terryjreedy
Copy link
Member

"Do we backport this to 3.9?" We almost never backport performance patches. Low risk is not no risk. With the yearly releases, there is less than a year to wait. With the early and repeated alphas, bleeding edge fans can get this soon enough.

@gvanrossum
Copy link
Member

OTOH, for the PEG parser we've been backporting everything relentlessly, otherwise when we fix a bug in 3.10 the backport is a pain in the neck.

@lysnikolaou
Copy link
Member Author

Ok, since there's no consensus, let's not backport this.

@lysnikolaou lysnikolaou merged commit bca7014 into python:master Oct 26, 2020
@lysnikolaou lysnikolaou deleted the parser-second-run branch October 26, 2020 22:42
@terryjreedy
Copy link
Member

I intentionally added 'almost' to my previous comment to suggest that carefully considered exceptions are possible. I forgot about the 3.9 escape option to run the old parser. Guido's comment, when merge conflicts really happen, is real to me from my experience with IDLE. We (I) 'relentlessly' backport everything, but have not always. Manual backports are an extra risk. So is not backporting a fix because it is too much trouble.

I believe PP was once tested on a large number of other python codebases other than the stdlib. Is that done routinely, and in particular for this patch (before and after)? That would make a difference to me.

@lysnikolaou
Copy link
Member Author

I think @pablogsal benchmarked this against something like the 10 most downloaded PyPI packages, although I'm not sure.

@pablogsal
Copy link
Member

I think @pablogsal benchmarked this against something like the 10 most downloaded PyPI packages, although I'm not sure.

Yup, I did the benchmark again the 15 most downloaded packages.

@lysnikolaou
Copy link
Member Author

FWIW I think it's worth it to backport, especially now that #23006 was merged into 3.9.

@pablogsal
Copy link
Member

FWIW I think it's worth it to backport, especially now that #23006 was merged into 3.9.

I would be supportive of the backport for the reasons previously mentioned

@gvanrossum
Copy link
Member

Agreed.

lysnikolaou added a commit to lysnikolaou/cpython that referenced this pull request Oct 27, 2020
…es on the second run (pythonGH-22111)

* Implement running the parser a second time for the errors messages

The first parser run is only responsible for detecting whether
there is a `SyntaxError` or not. If there isn't the AST gets returned.
Otherwise, the parser is run a second time with all the `invalid_*`
rules enabled so that all the customized error messages get produced.

(cherry picked from commit bca7014)
lysnikolaou added a commit that referenced this pull request Oct 28, 2020
…es on the second run (GH-22111) (GH-23011)

* Implement running the parser a second time for the errors messages

The first parser run is only responsible for detecting whether
there is a `SyntaxError` or not. If there isn't the AST gets returned.
Otherwise, the parser is run a second time with all the `invalid_*`
rules enabled so that all the customized error messages get produced.

(cherry picked from commit bca7014)
shihai1991 added a commit to shihai1991/cpython that referenced this pull request Oct 29, 2020
…lots1

* origin/master: (365 commits)
  bpo-42029: Remove IRIX code (pythonGH-23023)
  bpo-42143: Ensure PyFunction_NewWithQualName() can't fail after creating the func object (pythonGH-22953)
  bpo-34204: Use pickle.DEFAULT_PROTOCOL in shelve (pythonGH-19639)
  bpo-41805: Documentation for PEP 585 (pythonGH-22615)
  bpo-42161: Micro-optimize _collections._count_elements() (pythonGH-23008)
  bpo-42161: Remove private _PyLong_Zero and _PyLong_One (pythonGH-23003)
  bpo-42099: Fix reference to ob_type in unionobject.c and ceval (pythonGH-22829)
  bpo-41659: Disallow curly brace directly after primary (pythonGH-22996)
  bpo-6761: Enhance __call__ documentation (pythonGH-7987)
  bpo-42161: Modules/ uses _PyLong_GetZero() and _PyLong_GetOne() (pythonGH-22998)
  bpo-41474, Makefile: Add dependency on cpython/frameobject.h (pythonGH-22999)
  bpo-42157: Rename unicodedata.ucnhash_CAPI (pythonGH-22994)
  bpo-42161: Use _PyLong_GetZero() and _PyLong_GetOne() (pythonGH-22995)
  bpo-30681: Support invalid date format or value in email Date header (pythonGH-22090)
  bpo-42161: Add _PyLong_GetZero() and _PyLong_GetOne() (pythonGH-22993)
  bpo-42123: Run the parser two times and only enable invalid rules on the second run (pythonGH-22111)
  bpo-42157: Convert unicodedata.UCD to heap type (pythonGH-22991)
  bpo-42157: unicodedata avoids references to UCD_Type (pythonGH-22990)
  bpo-39101: Fixes BaseException hang in IsolatedAsyncioTestCase. (pythonGH-22654)
  bpo-1635741: _PyUnicode_Name_CAPI moves to internal C API (pythonGH-22713)
  ...
adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 13, 2021
…the second run (pythonGH-22111)

* Implement running the parser a second time for the errors messages

The first parser run is only responsible for detecting whether
there is a `SyntaxError` or not. If there isn't the AST gets returned.
Otherwise, the parser is run a second time with all the `invalid_*`
rules enabled so that all the customized error messages get produced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants