Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discard not supported while parsing with LALR #1084

Closed
rlaager opened this issue Jan 19, 2022 · 5 comments
Closed

Discard not supported while parsing with LALR #1084

rlaager opened this issue Jan 19, 2022 · 5 comments
Labels
add to docs TODO: Include this information in the docs

Comments

@rlaager
Copy link

rlaager commented Jan 19, 2022

Describe the bug

Discard on a transformer (when passed to the parser) does not work with the LALR parser.

To Reproduce

class MyTransformer(Transformer):
    def IGNORE_TOKEN(self, token):
        return Discard

# Transforming after parsing works:
parser = Lark(grammar, parser='lalr')
tree = MyTransformer().transform(parser.parse(input))

# Transforming while parsing does not work:
parser = Lark(grammar, parser='lalr', transformer=MyTransformer())
tree = parser.parse(input)

Here "does not work" means that the Discard singleton literally shows up in the output tree.

This is because parsers/lalr_parser.py has no code to support Discard.

I think the fix will be something like the following. I don't think this is quite right, and I don't know enough about the code to go much further:

--- parsers/lalr_parser.py	2022-01-18 16:25:34.129682004 -0600
+++ parsers/lalr_parser.py.new	2022-01-18 23:17:57.487042461 -0600
@@ -6,6 +6,7 @@
 from ..exceptions import UnexpectedInput, UnexpectedToken
 from ..lexer import Token
 from ..utils import Serialize
+from ..visitors import Discard
 
 from .lalr_analysis import LALR_Analyzer, Shift, Reduce, IntParseTable
 from .lalr_interactive_parser import InteractiveParser
@@ -134,7 +135,10 @@
                 # shift once and return
                 assert not is_end
                 state_stack.append(arg)
-                value_stack.append(token if token.type not in callbacks else callbacks[token.type](token))
+                if token.type in callbacks:
+                    token = callbacks[token.type](token)
+                if token is not Discard:
+                    value_stack.append(token)
                 return
             else:
                 # reduce+shift as many times as necessary
@@ -152,7 +156,8 @@
                 _action, new_state = states[state_stack[-1]][rule.origin.name]
                 assert _action is Shift
                 state_stack.append(new_state)
-                value_stack.append(value)
+                if value is not Discard:
+                    value_stack.append(value)
 
                 if is_end and state_stack[-1] == end_state:
                     return value_stack[-1]
@erezsh
Copy link
Member

erezsh commented Jan 19, 2022

Unfortunately isn't not that simple. value_stack has to stay aligned with state_stack for the algorithm to work.

If we were to add code to remove Discard, it would probably go in parse_tree_builder. That's where we currently skip tokens that start with _. However, that is information that's available at "compile_time", which means it can be fairly optimized.

In the case of Discard, I don't see an obvious solution, other than filtering it out of the created list:

        filtered = [x for x in filtered if x is not Discard]   # <-- Need to add this
        return self.node_builder(filtered)

But that requires an additional iteration over every children list, which might have a noticeable effect on performance. Which, I'm not sure is worth it, considering this issue has only been brought up twice in 4 years.

And of course, there's a simple "solution", which is to run Transformer().transform(tree) on whatever Lark returns, which will give you the correct result, and the performance cost will be very similar to the change I proposed above.

@erezsh
Copy link
Member

erezsh commented Jan 19, 2022

P.S. it can also work if we change lalr_parser:151 to

                value = callbacks[rule]([x for x in s if x is not Discard])

But that's basically equivalent.

@erezsh
Copy link
Member

erezsh commented Jan 19, 2022

There is a slightly more efficient solution, which is to keep track of all Discarded symbols in each state (in lockstep), and then subtract that from size when we handle the value_stack:

                 if value is Discard:
                    discarded_count[state] += 1
                 else:
                    value_stack.append(value)

And later:

                    s = value_stack[-(size - discarded_count[state]):]

It should work, and probably won't add a lot of runtime cost. Just a lot of complication for something so small. So is it worth it?

@rlaager
Copy link
Author

rlaager commented Jan 20, 2022

On principle, it seems like it should either work or it should be a documented caveat that passing a transformer to the parser doesn't honor Discard. You might want to mention that in the docs for both the parse() transformer argument and Discard, but definitely the latter.

As a practical concern, I suppose I could go either way. I wasn't aware that tokens starting with _ were dropped. That's a possible answer for us, since we haven't yet run into a situation where we need to discard based on conditional logic.

That said, we were operating on an assumption that discarding parts of the tree simply was not possible. As soon as we tried to do a discard, we found it didn't work. So we haven't really had a chance to audit the Transformer to see how much we could simplify based on the idea that Discard is a thing.

@erezsh
Copy link
Member

erezsh commented Jan 20, 2022

it should either work or it should be a documented caveat

That's a valid point. I'll add it to the docs.

we haven't yet run into a situation where we need to discard based on conditional logic

Let me know if that changes!

@erezsh erezsh added the add to docs TODO: Include this information in the docs label Jan 20, 2022
@erezsh erezsh closed this as completed Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to docs TODO: Include this information in the docs
Projects
None yet
Development

No branches or pull requests

2 participants