Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-33416: Add end positions to Python AST #11605

Merged
merged 37 commits into from
Jan 22, 2019

Conversation

ilevkivskyi
Copy link
Member

@ilevkivskyi ilevkivskyi commented Jan 18, 2019

The majority of this PR is tediously passing end_lineno and end_col_offset everywhere. Here are non-trivial points:

  • It is not possible to reconstruct end positions in AST "on the fly", some information is lost after an AST node is constructed, so we need two more attributes for every AST node end_lineno and end_col_offset.
  • I add end position information to both CST and AST. Although it may be technically possible to avoid adding end positions to CST, the code becomes more cumbersome and less efficient.
  • Since the end position is not known for non-leaf CST nodes while the next token is added, this requires a bit of extra care (see _PyNode_FinalizeEndPos). Unless I made some mistake, the algorithm should be linear.
  • For statements, I "trim" the end position of suites to not include the terminal newlines and dedent (this seems to be what people would expect), for example in
    class C:
        pass
    
    pass
    the end line and end column for the class definition is (2, 8).
  • For end_col_offset I use the common Python convention for indexing, for example for pass the end_col_offset is 4 (not 3), so that [0:4] gives one the source code that corresponds to the node.
  • I added a helper function ast.get_source_segment(), to get source text segment corresponding to a given AST node. It is also useful for testing.

An (inevitable) downside of this PR is that AST now takes almost 25% more memory. I think however it is probably justified by the benefits.

https://bugs.python.org/issue33416

@gvanrossum
Copy link
Member

Never mind, that was SyntaxError. The AST uses 0-based column offsets.

Copy link
Contributor

@asottile asottile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very excited for this change

The UTF-8 offset is recorded because the parser uses UTF-8 internally.

Note that the end positions are not required by the compiler and are
therefore optional. The end offset if *after* the last symbol, for example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
therefore optional. The end offset if *after* the last symbol, for example
therefore optional. The end offset is *after* the last symbol, for example

This is useful to "move code" to a different location in a file.
Increment the line number and end line number of each node in the tree
starting at *node* by *n*. This is useful to "move code" to a different
location in a file..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
location in a file..
location in a file.

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew! I don't have much to add, this all looks great. I don't think in these days 20% extra space to represent CST+AST for a single module is a big deal (though it would be nice to see what the difference is in absolute number of bytes for Lib/tkinter/init.py, which AFAIK is still the largest stdlib module).

to *new_node* if possible, and return *new_node*.
Copy source location (:attr:`lineno`, :attr:`col_offset`, :attr:`end_lineno`,
and :attr:`end_col_offset`) from *old_node* to *new_node* if possible,
and return *new_node*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only affect the node, or the whole tree? I've never used this so I don't know what to expect from context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one (unlike some others) is non-recursive, one can even copy from a node of a different kind.

@@ -438,152 +446,188 @@ mod_ty _Py_Interactive(asdl_seq * body, PyArena *arena);
mod_ty _Py_Expression(expr_ty body, PyArena *arena);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines up, I wonder if it would be a good idea to add a comment explaining that these macros also change the definition of the functions (not just the calls).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

_fix(child, lineno, col_offset)
_fix(node, 1, 0)
_fix(child, lineno, col_offset, end_lineno, end_col_offset)
_fix(node, 1, 0, 1, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function looks a bit suspicious.Shouldn't it at least ensure that (end_lineno, end_col_offset) > (lineno, col_offset)? (Again, I guess I don't know the use case.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used to compile manually generated AST to bytecode. Although technically end positions are not needed for the compiler (only start positions are recorded in the byte code), it is probably good to have them fixed. The algorithm for fixing is quite naive (take either a position of nearest root-side node if known, otherwise use start of the file), but it never intended to be robust, it exist just to allow compilation in those case where a user doesn't care about line numbers.

Lib/ast.py Outdated

if end_lineno == lineno:
source_line = source.splitlines()[lineno]
return source_line.encode()[col_offset:end_col_offset].decode(coding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I read this well, source is bytes, and the col_offsets are byte offsets? And a tab counts as one space? Is that explicit somewhere in the ast module docs? (This is easy to miss when writing code since most source code is ASCII only so the difference doesn't matter, but it would break whenever a token appears after a non-ASCII character in an identifier or string literal.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All you say is right, and yes it is already mentioned in the docs that they are UTF-8 byte offsets.

Lib/ast.py Outdated
last = lines[end_lineno].encode()[:end_col_offset].decode(coding)
middle_lines = lines[lineno+1:end_lineno]

return '\n'.join([first] + middle_lines + [last])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would avoid two list copies if you used .insert() and .append() to merge first and last into the middle_lines list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

self.assertEqual(ast.get_source_segment(s_orig, cdef.body[0], padded=True),
s_method)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I trust your tests. If they pass, they're good. :-)

@@ -0,0 +1 @@
Add end line and end column position information to the Python AST nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that this is a backwards incompatible API change for anything that creates AST nodes? (It's fine if you just call ast.parse(), but not if you ever create a new node.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually most of the things still work:

  • All existing manual AST creation code will work, ast.Constant(value=42, lineno=1, col_offset=0) is still valid.
  • Compilation of existing manually generated ASTs will still work, the end position are int? in Python.asdl, so they are allowed to be absent and/or be None.

However, some C-level compatibility is broken. I am not sure what is the source of truth for "official" C API. I checked docs.python.org and none of the functions I updated is listed there.

I will try to re-formulate this.

Python/ast.c Outdated
@@ -2864,7 +2918,12 @@ ast_for_call(struct compiling *c, const node *n, expr_ty func,
}
}

return Call(func, args, keywords, func->lineno, func->col_offset, c->c_arena);
// Ideally call ends where closing parenthesis is placed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(When does a call not have a close par?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It (close par) was NULL for "fake" calls used in analyzing class definitions, but I just fixed this and removed the comment.

get_last_end_pos(asdl_seq *s, int *end_lineno, int *end_col_offset)
{
int tot = asdl_seq_LEN(s);
stmt_ty last = asdl_seq_GET(s, tot - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a check that tot > 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this just in case (to be future proof against any AST optimizers).

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a part of comments.

@@ -162,6 +170,19 @@ and classes for traversing abstract syntax trees:
:class:`AsyncFunctionDef` is now supported.


.. function:: get_code_segment(source, node, *, padded=False, coding='utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think encoding is more common term in this context.

The UTF-8 offset is recorded because the parser uses UTF-8 internally.

Note that the end positions are not required by the compiler and are
therefore optional. The end offset if *after* the last symbol, for example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "optional" mean? Is it equal to None or an absent attribute?

Copy link
Member Author

@ilevkivskyi ilevkivskyi Jan 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually both are allowed. An optional attribute in Python.asdl can either be absent or None.

Lib/ast.py Outdated
If *padded* is `True`, the first line of a multi-line statement will
be padded with spaces to match its original position.
"""
if not hasattr(node, 'lineno'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified:

try:
    # get loneno, end_lineno, etc.
except AttributeError:
    return None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Lib/ast.py Outdated

if end_lineno == lineno:
source_line = source.splitlines()[lineno]
return source_line.encode()[col_offset:end_col_offset].decode(coding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think coding should not be used here. Always use UTF-8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are totally right.

Lib/ast.py Outdated
lines = source.splitlines()

if padded:
padding = ' ' * len(lines[lineno].encode()[:col_offset].decode(coding))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to keep tabs and spaces (and maybe '\f'?) and replace all other characters with spaces. There may be similar code in C, look how the position of the caret is calculated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes. It looks like there are bunch of corner cases around \r\n\f\t. I tried to update the function (including your suggestions) so that it mimics closely what happens in C code.

@@ -62,14 +62,14 @@ def test_product(self):

def test_attributes(self):
stmt = self.types['stmt']
self.assertEqual(len(stmt.attributes), 2)
self.assertEqual(len(stmt.attributes), 4)
self.assertEqual(str(stmt.attributes[0]), 'Field(int, lineno)')
self.assertEqual(str(stmt.attributes[1]), 'Field(int, col_offset)')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for stmt.attributes[2] and stmt.attributes[3]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

int
PyNode_AddChild(node *n1, int type, char *str, int lineno, int col_offset)
PyNode_AddChild(node *n1, int type, char *str, int lineno, int col_offset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not this a part of the C API? Should not we add a new function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "official" source of truth for C API? I didn't find this function on docs.python.org.

@@ -225,7 +229,9 @@ future_hack(parser_state *ps)

int
PyParser_AddToken(parser_state *ps, int type, char *str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not this a part of the C API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@ilevkivskyi
Copy link
Member Author

@asottile @gvanrossum @serhiy-storchaka Thank you for reviewing! I think I addressed all your comments.

@@ -213,6 +226,77 @@ def get_docstring(node, clean=True):
return text


def _splitlines_no_ff(source):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to use .splitlines(keepends=True) (which considers \f as an endline) and then re-join those that aren't ending in \r or \n, I am not sure which approach is better.

@ilevkivskyi
Copy link
Member Author

@gvanrossum I just tried to compare the sizes of Lib/tkinter/__init__.py AST before and after this PR. Here are the numbers: before 5.8 Mbytes, after 7.2Mbytes (24% increase).

@gvanrossum
Copy link
Member

@gvanrossum I just tried to compare the sizes of Lib/tkinter/__init__.py AST before and after this PR. Here are the numbers: before 5.8 Mbytes, after 7.2Mbytes (24% increase).

How did you measure this? IIUC there's also a size increase for the CST, and during translation from CST to AST both are in memory. Is it possible to measure the peak memory usage during this translation?

@ilevkivskyi
Copy link
Member Author

I just looked at the difference in allocated memory reported in sys._debugmallocstats(). I just tried another way using tracemalloc. It reports that the line in question:

compile(source, filename, mode, PyCF_ONLY_AST)

allocates 3.7 Mbytes before and 4.6 Mbytes after this PR (still the same 24% relative increase). Also this number (24%) is quite reasonable taking into account that all nodes in CST have exactly the same size: 40 bytes before, 48 bytes after (at least on my 64-bit Linux).

@gvanrossum
Copy link
Member

I do think that's a very reasonable increase (we're not talking MicroPython here :-).

Also this number (24%) is quite reasonable taking into account that all nodes in CST have exactly the same size: 40 bytes before, 48 bytes after (at least on my 64-bit Linux).

That's sizeof what exactly? While the node types indeed are big unions, there are several different ones: struct _stmt, struct _expr and some minor ones (all defined in Include/Python-ast.h).

(I tried this myself on my Mac, and it looks like sizeof(struct _stmt) is 72 bytes, and sizeof(struct _expr) is 48 bytes. An int is 4 bytes.)

@ilevkivskyi
Copy link
Member Author

That's sizeof what exactly?

This is a size of _node. There is only one "main" struct in CST struct _node from node.h.

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! Let's land this. I combed through the code (both the tedious parts and the interesting parts) and it all looks good to me. I am not worried about the 25% increase in CST and AST tree sizes.

@ilevkivskyi
Copy link
Member Author

@gvanrossum
Great, thanks! I will merge this now, so that we can move with the typed_ast PR, but there is one question that Serhiy raised and I would like to double check: does the fact that PyNode_AddChild and PyParser_AddToken are not documented on docs.python.org mean they are not part of C-API? Also I just checked the files where they are declared (node.h and parser.h) are not included in Python.h.

@ilevkivskyi ilevkivskyi merged commit 9932a22 into python:master Jan 22, 2019
@ilevkivskyi ilevkivskyi deleted the add-end-line-col branch January 22, 2019 11:18
@gvanrossum
Copy link
Member

does the fact that PyNode_AddChild and PyParser_AddToken are not documented on docs.python.org mean they are not part of C-API?

I'm not sure. Their names don't have _ prefixes so I think they live in some nebulous area. It would be good to bring this up in another forum where more core devs can think about the issue (some of whome have probably thought about it more).

In the worst case scenario, if it's deemed an unacceptable backwards incompatibility, we can always add new functions that add end_lineno and end_col_offset arguments, and make the old ones call the new ones with some kind of default values computed. But I would only resort to this if during the alpha/beta release we get actual complaints.

@ilevkivskyi
Copy link
Member Author

OK, I will post to Python-Dev about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants