Discard not supported while parsing with LALR #1084

rlaager · 2022-01-19T05:19:30Z

Describe the bug

Discard on a transformer (when passed to the parser) does not work with the LALR parser.

To Reproduce

class MyTransformer(Transformer):
    def IGNORE_TOKEN(self, token):
        return Discard

# Transforming after parsing works:
parser = Lark(grammar, parser='lalr')
tree = MyTransformer().transform(parser.parse(input))

# Transforming while parsing does not work:
parser = Lark(grammar, parser='lalr', transformer=MyTransformer())
tree = parser.parse(input)

Here "does not work" means that the Discard singleton literally shows up in the output tree.

This is because parsers/lalr_parser.py has no code to support Discard.

I think the fix will be something like the following. I don't think this is quite right, and I don't know enough about the code to go much further:

--- parsers/lalr_parser.py	2022-01-18 16:25:34.129682004 -0600
+++ parsers/lalr_parser.py.new	2022-01-18 23:17:57.487042461 -0600
@@ -6,6 +6,7 @@
 from ..exceptions import UnexpectedInput, UnexpectedToken
 from ..lexer import Token
 from ..utils import Serialize
+from ..visitors import Discard
 
 from .lalr_analysis import LALR_Analyzer, Shift, Reduce, IntParseTable
 from .lalr_interactive_parser import InteractiveParser
@@ -134,7 +135,10 @@
                 # shift once and return
                 assert not is_end
                 state_stack.append(arg)
-                value_stack.append(token if token.type not in callbacks else callbacks[token.type](token))
+                if token.type in callbacks:
+                    token = callbacks[token.type](token)
+                if token is not Discard:
+                    value_stack.append(token)
                 return
             else:
                 # reduce+shift as many times as necessary
@@ -152,7 +156,8 @@
                 _action, new_state = states[state_stack[-1]][rule.origin.name]
                 assert _action is Shift
                 state_stack.append(new_state)
-                value_stack.append(value)
+                if value is not Discard:
+                    value_stack.append(value)
 
                 if is_end and state_stack[-1] == end_state:
                     return value_stack[-1]

The text was updated successfully, but these errors were encountered:

erezsh · 2022-01-19T07:44:07Z

Unfortunately isn't not that simple. value_stack has to stay aligned with state_stack for the algorithm to work.

If we were to add code to remove Discard, it would probably go in parse_tree_builder. That's where we currently skip tokens that start with _. However, that is information that's available at "compile_time", which means it can be fairly optimized.

In the case of Discard, I don't see an obvious solution, other than filtering it out of the created list:

        filtered = [x for x in filtered if x is not Discard]   # <-- Need to add this
        return self.node_builder(filtered)

But that requires an additional iteration over every children list, which might have a noticeable effect on performance. Which, I'm not sure is worth it, considering this issue has only been brought up twice in 4 years.

And of course, there's a simple "solution", which is to run Transformer().transform(tree) on whatever Lark returns, which will give you the correct result, and the performance cost will be very similar to the change I proposed above.

erezsh · 2022-01-19T07:56:44Z

P.S. it can also work if we change lalr_parser:151 to

                value = callbacks[rule]([x for x in s if x is not Discard])

But that's basically equivalent.

erezsh · 2022-01-19T08:04:45Z

There is a slightly more efficient solution, which is to keep track of all Discarded symbols in each state (in lockstep), and then subtract that from size when we handle the value_stack:

                 if value is Discard:
                    discarded_count[state] += 1
                 else:
                    value_stack.append(value)

And later:

                    s = value_stack[-(size - discarded_count[state]):]

It should work, and probably won't add a lot of runtime cost. Just a lot of complication for something so small. So is it worth it?

rlaager · 2022-01-20T02:09:32Z

On principle, it seems like it should either work or it should be a documented caveat that passing a transformer to the parser doesn't honor Discard. You might want to mention that in the docs for both the parse() transformer argument and Discard, but definitely the latter.

As a practical concern, I suppose I could go either way. I wasn't aware that tokens starting with _ were dropped. That's a possible answer for us, since we haven't yet run into a situation where we need to discard based on conditional logic.

That said, we were operating on an assumption that discarding parts of the tree simply was not possible. As soon as we tried to do a discard, we found it didn't work. So we haven't really had a chance to audit the Transformer to see how much we could simplify based on the idea that Discard is a thing.

erezsh · 2022-01-20T08:03:26Z

it should either work or it should be a documented caveat

That's a valid point. I'll add it to the docs.

we haven't yet run into a situation where we need to discard based on conditional logic

Let me know if that changes!

erezsh added the add to docs TODO: Include this information in the docs label Jan 20, 2022

erezsh added a commit that referenced this issue Mar 11, 2022

Docs: Updated docs for Discard (Issue #1084)

a38e481

erezsh closed this as completed Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discard not supported while parsing with LALR #1084

Discard not supported while parsing with LALR #1084

rlaager commented Jan 19, 2022

erezsh commented Jan 19, 2022 •

edited

Loading

erezsh commented Jan 19, 2022

erezsh commented Jan 19, 2022 •

edited

Loading

rlaager commented Jan 20, 2022

erezsh commented Jan 20, 2022

Discard not supported while parsing with LALR #1084

Discard not supported while parsing with LALR #1084

Comments

rlaager commented Jan 19, 2022

erezsh commented Jan 19, 2022 • edited Loading

erezsh commented Jan 19, 2022

erezsh commented Jan 19, 2022 • edited Loading

rlaager commented Jan 20, 2022

erezsh commented Jan 20, 2022

erezsh commented Jan 19, 2022 •

edited

Loading

erezsh commented Jan 19, 2022 •

edited

Loading