Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic tree typing #967

Merged
merged 11 commits into from
Jan 21, 2022
Merged

Generic tree typing #967

merged 11 commits into from
Jan 21, 2022

Conversation

plannigan
Copy link
Contributor

  • Updates the type information for Tree so that it is generic based on the type of the child leaf.
  • Adjusts the various classes that operate on trees to also understand the type of trees they can operate on.
  • Add ParseTree alias so that it is easy to refer to tree produced by the lark parser.

The changes don't introduce any new mypy errors, but there is one case where more type information demonstrated an existing inconsistency.

Addresses #966

Copy link
Member

@MegaIng MegaIng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice :-)

@@ -109,17 +112,18 @@ def iter_subtrees(self) -> 'Iterator[Tree]':
subtrees = OrderedDict()
for subtree in queue:
subtrees[id(subtree)] = subtree
queue += [c for c in reversed(subtree.children)
# Reason for type ignore https://github.com/python/mypy/issues/10999
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current type system it isn't possible to convey that "_Leaf_T can be anything except a Tree". So this will likely always be a thing mypy won't be able to figure out on its own. The referenced issue has more context and might provide some insights on how to better handle it.

lark/visitors.py Outdated Show resolved Hide resolved
Copy link
Member

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the creative PR!

I have a few questions, since I'm not well versed in Python's type annotations.

I'm also, in general, a little concerned about the added complication. I can't figure out if it's correct just from a cursory glance. But perhaps that would improve as I get used to it.

lark/tree.py Outdated Show resolved Hide resolved
lark/tree.py Show resolved Hide resolved
lark/visitors.py Outdated Show resolved Hide resolved
@@ -591,7 +592,7 @@ def parse_interactive(self, text: Optional[str]=None, start: Optional[str]=None)
"""
return self.parser.parse_interactive(text, start=start)

def parse(self, text: str, start: Optional[str]=None, on_error: 'Optional[Callable[[UnexpectedInput], bool]]'=None) -> Tree:
def parse(self, text: str, start: Optional[str]=None, on_error: 'Optional[Callable[[UnexpectedInput], bool]]'=None) -> 'ParseTree':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change this to 'Tree[Token]', there is no need for a ParseTree variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about it not being needed in this instance or not being needed in general?

  • For this specific use case, yes. That would not require the added import of ParseTree when type checking.
  • In general, ParseTree isn't strictly required at all in general. However, I think it is a nice ease of use to a have a specific name for trees returned by parse() since it is a core type that users of the library will be interacting with a lot.

lark/visitors.py Outdated Show resolved Hide resolved
@plannigan plannigan requested a review from erezsh August 27, 2021 15:15

def __init__(self, *transformers: Transformer[_T]) -> None:
def __init__(self, *transformers: 'Union[Transformer, TransformerChain]') -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the initialize taking a N transformers, it is not possible to use type annotations to ensure that the T -> U -> V sequence is correct. This could be addressed by changing the class to only ever be aware of 2 Transformers. That is the only way it is called in the library, but users could be building chains directly.

Since there is already a lot going on in this PR. I'll hold off from also making any changes to the number of arguments accepted by this function.

@erezsh erezsh marked this pull request as ready for review December 14, 2021 14:14
@erezsh
Copy link
Member

erezsh commented Dec 14, 2021

Sorry it took me so long to review this. Overall it looks fine.

A few notes:

  • The phrase Union[_Leaf_T, Tree[_Leaf_T]] repeats itself quite a lot, making it hard to read. Why not set
Branch = Union[_Leaf_T, 'Tree[_Leaf_T]']

and use that throughout the code? (preliminary tests show that it should work).

Copy link
Member

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional notes

lark/visitors.py Outdated Show resolved Hide resolved
lark/visitors.py Outdated
def __mul__(self, other: 'Transformer[_T]') -> 'TransformerChain[_T]':
def __mul__(
self: 'Transformer[Tree[_Leaf_U], _Leaf_T]',
other: 'Union[Transformer[_Return_V, _Leaf_U], TransformerChain[_Return_V, _Leaf_U]]'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like everywhere that TransformerChain is used as a type, so is Transformer. So maybe it makes sense to just use AbsTransformer, or ITransformer instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true. Transformer and TransformerChain have the same public interface. Protocol could be used to define that without requiring a base class. (It would just require typing-extensions for less than Python 3.8, but that's not a big issue.) I can work on that next.

while q:
t = q.pop()
rev_postfix.append(t)
if isinstance(t, Tree):
q += t.children

# Postfix to tree
stack = []
stack: List = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this addition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so this is a fun one:

  • There isn't enough type information based on the stack.append() calls for mypy to understand what should be allowed to be in the list. So it asks for the initialization of the variable to be annotated.
  • However, the list can contain anything based on the implementations of the user functions (See comment). So that means that the type is List[Any].
  • But the generic Any is redundant and can be removed.

So we are left we are left with an annotation that seems to be redundant to humans, but is there to explicitly tell mypy how the list is intended to be used.

Branch must be used as a generic so that mypy understands that _Leaf_T is the same type associated with the Tree

def __init__(self, data: str, children: 'List[Union[_Leaf_T, Tree[_Leaf_T]]]', meta: Optional[Meta]=None) -> None:
def __init__(self, data: str, children: 'List[Branch[_Leaf_T]]', meta: Optional[Meta]=None) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if it's obvious, but why do we need to write Branch[_Leaf_T] and not just Branch, since it's already defined using _Leaf_T?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem, it is a subtle point. At the top level _Leaf_T stand for some type, but it can still be anything. Once a variable containing a tree is defined as Tree[Foo], then that tree's _Leaf_T is bound to Foo. Since Branch is defined at the top level, it can still be any type. Specifically using Branch[_Leaf_T] within the context of a Tree, means that the type has to match the bound _Leaf_T (in our example Foo).

lark/visitors.py Outdated
from typing_extensions import Protocol

_Return_T = TypeVar('_Return_T', covariant=True)
_Return_V = TypeVar('_Return_V', covariant=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: Why is it Leaf_U but Return_V? Do the U and V stand for something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature for the multiplication operator has a lot of moving pieces. It starts with a ITransformer[_Leaf_T, Tree[_Leaf_U]] instance (a transformer that converts a Tree[_Leaf_T] to Tree[_Leaf_U]). Then a ITransformer[_Leaf_U, _Return_V] instance (a transformer that converts a Tree[_Leaf_U] to _Return_V). Which results in a ITransformer[_Leaf_T, _Return_V] (a transformer that converts a Tree[_Leaf_T] to _Return_V).

There isn't a technical reason why it needs to be named with _Return_V instead of _Return_U. However, I was thinking that using _Return_V would make it clearer that the final type isn't necessarily related to the intermediary _Leaf_U. I'm open to switching to _Leaf_U or putting some of my above description into a comment in the source code if you think the current state should be changed.

Copy link
Contributor

@chanicpanic chanicpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #926 we chose not to add typing_extensions as a dependency. A workaround would be to put the import of Protocol and the definition of ITransformer within an if TYPE_CHECKING.

@plannigan
Copy link
Contributor Author

Ok, I didn't have the context of that previous PR when I implemented this change. The one catch that makes this different is that ITransformer would only exist within lark and is exposed as part of the public API (argument to __mul__()). So if a user wanted to annotate a variable as ITransformer[Foo, Bar], they would have to do:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
  from lark.visitors import ITransformer

class Foo:
    pass

class Bar:
    pass

my_tansformer: 'ITransformer[Foo, Bar]'

That being said, ITransformer is a protocol, so doing that is not a requirement.

This leaves us with two options:

  • Use the current Protocol implementation with a TYPE_CHECKING guard. And document that, while variables can be annotated with ITransformer using the example above, the recommendation is to not do that (relying on the functionality of Protocol).
  • Switch back to the previous Union implementation and avoid this potentially confusing corner case at the cost of slightly less readable type annotations in a few places.

I'm open to either option, but have a slight preference to using Union as it doesn't introduce the odd corner case.

@erezsh
Copy link
Member

erezsh commented Jan 2, 2022

Is there a way to get the same effect without Protocol? Maybe with an ABC somehow?

Anyway, I agree, it doesn't seem worth it to add a dependency just to support ITransformer for 3.6 and 3.7

@plannigan
Copy link
Contributor Author

ABC could also work, but would require a inheritance of the base class, something Transformer and TransformerChain don't currently have. I'll revert back to the Union implementation.

@plannigan plannigan requested a review from erezsh January 2, 2022 14:10
@plannigan
Copy link
Contributor Author

Reminder from an earlier conversation:

I was about to add a new comment about that. I agree that it reads better at Transformer[_Leaf_T, _Return_T] . I initially added the leaf type at the end to minimize that changes. However, going from 1 to 2 generic parameters is a breaking change anyways, so moving leaf to be the first type passed in isn't a huge issue.

Note: These typing changes are only breaking changes for for code already taking advantage of type annotations and the generic types.

Since the PR didn't make it into the 1.0 release, that is something that should be kept in mind. If the choice is made to go the 2.0 route, I remember seeing a couple of places where minor API changes could resolve a few of the mypy errors that are currently reported on the library.

@erezsh
Copy link
Member

erezsh commented Jan 3, 2022

If the choice is made to go the 2.0 route

You mean Lark 2.0? That's not going to happen for a while.

minor API changes could resolve a few of the mypy errors

Like what?

@plannigan
Copy link
Contributor Author

I understand the desire to not have a 2.0 so shortly after 1.0 was released. But it is unfortunate that this PR did not get merged in time when breaking changes were less of a concern.

I think I should be able to resolve most, maybe all, of the existing smaller mypy issue with smaller API-compatible PRs., But two minor API changes I'd recommend:

1 - Have TransformerChain take only two Transformers in __init__() instead of N. This would allow for catching:

  • An attempt to chain a Transformer on after a Transformer that produces a non-Tree value
  • The first Transformer produces Tree[Foo], while the second Transformer expects a Tree[Bar].
  • Would disallow things like TransformerChain() or TransformerChain(my_transformer) as these don't provide much value.
    This wouldn't break any code in the code base. However, it could cause break users that explicitly create TransformerChains instead of using multiplication operator AND passed in a less or more than 2 arguments.)

2 - Replace Tree.expand_kids_by_index() with a similar method that accepts a predicate (like find_pred() & scan_values()) instead of an arbitrary number of indexes. The current API expects the given indexes to correspond to Tree instances in the list of children. However, this allows for:

  • indexes that aren't Tree instances (AttributeError)
  • invalid indexes (IndexError)

The locations in the code base that use this method use it correctly by first doing a list comprehension with an if clause that ensures the index points to a Tree instance and satisfies the required criteria. Using a predicate instead would:

  • Ensure that non-Tree instances are skipped
  • Prevent indexing outside of the list.
  • Remove an extra iteration currently being done to produce the indexes to expand.
  • Simplify the calling code by only requiring a predicate of Callable[[Tree],bool].

A larger API change I'd recommend to be included in a theoretical future 2.0 would be to remove the use of custom double underscore names (ex. __default__()) with in the library as PEP 8 says:

__double_leading_and_trailing_underscore__: "magic" objects or attributes that live in user-controlled namespaces. E.g. __init__, __import__ or __file__. Never invent such names; only use them as documented.

@erezsh
Copy link
Member

erezsh commented Jan 4, 2022

would be to remove the use of custom double underscore names (ex. default())

What would you name it instead? (the goal is to avoid name collision with any user-defined rule)

@plannigan
Copy link
Contributor Author

Since "rules whose name begins with an underscore will be inlined into their containing rule", a parse tree won't have any with a single leading underscore. So _default() could work. _default_tree() would more closely match _default_token() and be less generic. It is a bit more verbose, but _tree_func_not_found() is more explicit.

I actually ran into an issue with a rule name collision with a Python keyword. I'll open a new issue for that to keep this discussion on topic, but the solution I had in mind would also help resolve the issue with rule names colliding with lark "internal" names.

@erezsh
Copy link
Member

erezsh commented Jan 4, 2022

The main issue with something like _default() is that single underscores signify "private members" in Python, so I think many users will find it confusing, that it is accessed from the outside.

(and __ means private + protection, but it's rarely used in practice).

In contrast, magic_method syntax is the common way to facilitate interaction with the runtime, like eq, name, and iter. Also other libraries, like rich, use this for similar purposes: https://rich.readthedocs.io/en/stable/protocol.html

Copy link
Member

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good enough for now.
Any improvements to it can be added in separate PRs.

@erezsh
Copy link
Member

erezsh commented Jan 20, 2022

@MegaIng @chanicpanic Anything to add before I merge it?

@erezsh erezsh merged commit ee27eb4 into lark-parser:1.0b Jan 21, 2022
@erezsh
Copy link
Member

erezsh commented Jan 21, 2022

Thanks @plannigan ! And sorry for being such a stickler :)

Oops: Forgot it was rebased on top of 1.0b . I'll get it into master.

Edit2: Done.

@plannigan plannigan deleted the generic_tree_typing branch January 23, 2022 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants