-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-33416: Add end positions to Python AST #11605
Conversation
Conflicts: Parser/parsetok.c
Never mind, that was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very excited for this change
Doc/library/ast.rst
Outdated
The UTF-8 offset is recorded because the parser uses UTF-8 internally. | ||
|
||
Note that the end positions are not required by the compiler and are | ||
therefore optional. The end offset if *after* the last symbol, for example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
therefore optional. The end offset if *after* the last symbol, for example | |
therefore optional. The end offset is *after* the last symbol, for example |
Doc/library/ast.rst
Outdated
This is useful to "move code" to a different location in a file. | ||
Increment the line number and end line number of each node in the tree | ||
starting at *node* by *n*. This is useful to "move code" to a different | ||
location in a file.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
location in a file.. | |
location in a file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew! I don't have much to add, this all looks great. I don't think in these days 20% extra space to represent CST+AST for a single module is a big deal (though it would be nice to see what the difference is in absolute number of bytes for Lib/tkinter/init.py, which AFAIK is still the largest stdlib module).
to *new_node* if possible, and return *new_node*. | ||
Copy source location (:attr:`lineno`, :attr:`col_offset`, :attr:`end_lineno`, | ||
and :attr:`end_col_offset`) from *old_node* to *new_node* if possible, | ||
and return *new_node*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this only affect the node, or the whole tree? I've never used this so I don't know what to expect from context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one (unlike some others) is non-recursive, one can even copy from a node of a different kind.
@@ -438,152 +446,188 @@ mod_ty _Py_Interactive(asdl_seq * body, PyArena *arena); | |||
mod_ty _Py_Expression(expr_ty body, PyArena *arena); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few lines up, I wonder if it would be a good idea to add a comment explaining that these macros also change the definition of the functions (not just the calls).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
_fix(child, lineno, col_offset) | ||
_fix(node, 1, 0) | ||
_fix(child, lineno, col_offset, end_lineno, end_col_offset) | ||
_fix(node, 1, 0, 1, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole function looks a bit suspicious.Shouldn't it at least ensure that (end_lineno, end_col_offset) > (lineno, col_offset)
? (Again, I guess I don't know the use case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is used to compile manually generated AST to bytecode. Although technically end positions are not needed for the compiler (only start positions are recorded in the byte code), it is probably good to have them fixed. The algorithm for fixing is quite naive (take either a position of nearest root-side node if known, otherwise use start of the file), but it never intended to be robust, it exist just to allow compilation in those case where a user doesn't care about line numbers.
Lib/ast.py
Outdated
|
||
if end_lineno == lineno: | ||
source_line = source.splitlines()[lineno] | ||
return source_line.encode()[col_offset:end_col_offset].decode(coding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I read this well, source
is bytes, and the col_offset
s are byte offsets? And a tab counts as one space? Is that explicit somewhere in the ast
module docs? (This is easy to miss when writing code since most source code is ASCII only so the difference doesn't matter, but it would break whenever a token appears after a non-ASCII character in an identifier or string literal.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All you say is right, and yes it is already mentioned in the docs that they are UTF-8 byte offsets
.
Lib/ast.py
Outdated
last = lines[end_lineno].encode()[:end_col_offset].decode(coding) | ||
middle_lines = lines[lineno+1:end_lineno] | ||
|
||
return '\n'.join([first] + middle_lines + [last]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would avoid two list copies if you used .insert() and .append() to merge first and last into the middle_lines list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
self.assertEqual(ast.get_source_segment(s_orig, cdef.body[0], padded=True), | ||
s_method) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I trust your tests. If they pass, they're good. :-)
@@ -0,0 +1 @@ | |||
Add end line and end column position information to the Python AST nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add that this is a backwards incompatible API change for anything that creates AST nodes? (It's fine if you just call ast.parse()
, but not if you ever create a new node.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually most of the things still work:
- All existing manual AST creation code will work,
ast.Constant(value=42, lineno=1, col_offset=0)
is still valid. - Compilation of existing manually generated ASTs will still work, the end position are
int?
inPython.asdl
, so they are allowed to be absent and/or beNone
.
However, some C-level compatibility is broken. I am not sure what is the source of truth for "official" C API. I checked docs.python.org and none of the functions I updated is listed there.
I will try to re-formulate this.
Python/ast.c
Outdated
@@ -2864,7 +2918,12 @@ ast_for_call(struct compiling *c, const node *n, expr_ty func, | |||
} | |||
} | |||
|
|||
return Call(func, args, keywords, func->lineno, func->col_offset, c->c_arena); | |||
// Ideally call ends where closing parenthesis is placed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(When does a call not have a close par?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It (close par) was NULL
for "fake" calls used in analyzing class definitions, but I just fixed this and removed the comment.
get_last_end_pos(asdl_seq *s, int *end_lineno, int *end_col_offset) | ||
{ | ||
int tot = asdl_seq_LEN(s); | ||
stmt_ty last = asdl_seq_GET(s, tot - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a check that tot > 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this just in case (to be future proof against any AST optimizers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a part of comments.
Doc/library/ast.rst
Outdated
@@ -162,6 +170,19 @@ and classes for traversing abstract syntax trees: | |||
:class:`AsyncFunctionDef` is now supported. | |||
|
|||
|
|||
.. function:: get_code_segment(source, node, *, padded=False, coding='utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think encoding is more common term in this context.
Doc/library/ast.rst
Outdated
The UTF-8 offset is recorded because the parser uses UTF-8 internally. | ||
|
||
Note that the end positions are not required by the compiler and are | ||
therefore optional. The end offset if *after* the last symbol, for example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "optional" mean? Is it equal to None or an absent attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually both are allowed. An optional attribute in Python.asdl
can either be absent or None
.
Lib/ast.py
Outdated
If *padded* is `True`, the first line of a multi-line statement will | ||
be padded with spaces to match its original position. | ||
""" | ||
if not hasattr(node, 'lineno'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified:
try:
# get loneno, end_lineno, etc.
except AttributeError:
return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Lib/ast.py
Outdated
|
||
if end_lineno == lineno: | ||
source_line = source.splitlines()[lineno] | ||
return source_line.encode()[col_offset:end_col_offset].decode(coding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think coding should not be used here. Always use UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are totally right.
Lib/ast.py
Outdated
lines = source.splitlines() | ||
|
||
if padded: | ||
padding = ' ' * len(lines[lineno].encode()[:col_offset].decode(coding)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to keep tabs and spaces (and maybe '\f'?) and replace all other characters with spaces. There may be similar code in C, look how the position of the caret is calculated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes. It looks like there are bunch of corner cases around \r\n\f\t
. I tried to update the function (including your suggestions) so that it mimics closely what happens in C code.
@@ -62,14 +62,14 @@ def test_product(self): | |||
|
|||
def test_attributes(self): | |||
stmt = self.types['stmt'] | |||
self.assertEqual(len(stmt.attributes), 2) | |||
self.assertEqual(len(stmt.attributes), 4) | |||
self.assertEqual(str(stmt.attributes[0]), 'Field(int, lineno)') | |||
self.assertEqual(str(stmt.attributes[1]), 'Field(int, col_offset)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests for stmt.attributes[2] and stmt.attributes[3]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
int | ||
PyNode_AddChild(node *n1, int type, char *str, int lineno, int col_offset) | ||
PyNode_AddChild(node *n1, int type, char *str, int lineno, int col_offset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not this a part of the C API? Should not we add a new function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "official" source of truth for C API? I didn't find this function on docs.python.org.
@@ -225,7 +229,9 @@ future_hack(parser_state *ps) | |||
|
|||
int | |||
PyParser_AddToken(parser_state *ps, int type, char *str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not this a part of the C API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
@asottile @gvanrossum @serhiy-storchaka Thank you for reviewing! I think I addressed all your comments. |
@@ -213,6 +226,77 @@ def get_docstring(node, clean=True): | |||
return text | |||
|
|||
|
|||
def _splitlines_no_ff(source): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to use .splitlines(keepends=True)
(which considers \f
as an endline) and then re-join those that aren't ending in \r
or \n
, I am not sure which approach is better.
@gvanrossum I just tried to compare the sizes of |
How did you measure this? IIUC there's also a size increase for the CST, and during translation from CST to AST both are in memory. Is it possible to measure the peak memory usage during this translation? |
I just looked at the difference in allocated memory reported in compile(source, filename, mode, PyCF_ONLY_AST) allocates 3.7 Mbytes before and 4.6 Mbytes after this PR (still the same 24% relative increase). Also this number (24%) is quite reasonable taking into account that all nodes in CST have exactly the same size: 40 bytes before, 48 bytes after (at least on my 64-bit Linux). |
I do think that's a very reasonable increase (we're not talking MicroPython here :-).
That's (I tried this myself on my Mac, and it looks like |
This is a size of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! Let's land this. I combed through the code (both the tedious parts and the interesting parts) and it all looks good to me. I am not worried about the 25% increase in CST and AST tree sizes.
@gvanrossum |
I'm not sure. Their names don't have In the worst case scenario, if it's deemed an unacceptable backwards incompatibility, we can always add new functions that add |
OK, I will post to Python-Dev about this. |
The majority of this PR is tediously passing
end_lineno
andend_col_offset
everywhere. Here are non-trivial points:end_lineno
andend_col_offset
._PyNode_FinalizeEndPos
). Unless I made some mistake, the algorithm should be linear.end_col_offset
I use the common Python convention for indexing, for example forpass
theend_col_offset
is 4 (not 3), so that[0:4]
gives one the source code that corresponds to the node.ast.get_source_segment()
, to get source text segment corresponding to a given AST node. It is also useful for testing.An (inevitable) downside of this PR is that AST now takes almost 25% more memory. I think however it is probably justified by the benefits.
https://bugs.python.org/issue33416