Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #696 now providing the correct amount of placeholders #922

Merged
merged 4 commits into from
Sep 13, 2021

Conversation

ornariece
Copy link
Contributor

p = Lark("""!start: ["a" "b" "c"] """, maybe_placeholders=True)
p.parse("").children

now returns [None, None, None] instead of [None]

same for !start: ["a" ["b" "c"]].

Copy link
Member

@MegaIng MegaIng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a lot shorter of a PR than I expected. Good job!

@@ -231,12 +231,11 @@ def will_not_get_removed(sym):
return not sym.name.startswith('_')
if isinstance(sym, Terminal):
return keep_all_tokens or not sym.filter_out
assert False
if isinstance(sym, Symbol):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed. There are only two Symbols subclasses: NonTerminal and Terminal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now _EMPTY is a Symbol, and isn't a terminal or nonterminal.

But it probably makes more sense to test is _EMPTY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea testing for is _EMPTY would make more sense. i went with isinstance(sym, Symbol) because Symbol is only ever called directly when defining _EMPTY, but it's less clear and less precise than testing for is _EMPTY.

@@ -231,12 +231,11 @@ def will_not_get_removed(sym):
return not sym.name.startswith('_')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to think about changing this behavior. But I am not sure if there is a better solution...

@erezsh
Copy link
Member

erezsh commented Jun 27, 2021

@MegaIng I'm surprised you aren't concerned with breaking backwards-compatibility. I think I'll have to add a new value for this behavior. Maybe maybe_placeholders='per_symbol'. Open to suggestions.

@MegaIng
Copy link
Member

MegaIng commented Jun 27, 2021

@erezsh True. I didn't consider it. This is obviously breaking. We can add a new option, or push it back till 1.0 . (on the other hand, I find the current system so confusing, we might as well break in the next 0.x.1 release.)

@ThatXliner
Copy link
Contributor

@MegaIng I'm surprised you aren't concerned with breaking backwards-compatibility. I think I'll have to add a new value for this behavior. Maybe maybe_placeholders='per_symbol'. Open to suggestions.

Yes, definitely. I think ["thing" "thing" "thing"] should return [None]. It makes the most sense to me and I have code that needs this behavior.

@erezsh
Copy link
Member

erezsh commented Jun 28, 2021

It makes the most sense to me

@ThatXliner Can you explain your reasoning?

The issue is that a match and a mismatch may provide a different amount of arguments.

@ThatXliner
Copy link
Contributor

It makes the most sense to me

@ThatXliner Can you explain your reasoning?

Well the docs said that [...] should act like (...)? but return None on no match, right?

The issue is that a match and a mismatch may provide a different amount of arguments.

Yeah I saw. You originally thought of it being the right behavior, right?

The point is, I agree that we should change it to maybe_placeholders="per_symbol" or something to keep backwards incompatibility.

@MegaIng
Copy link
Member

MegaIng commented Jun 28, 2021

@ThatXliner Our current plan is to add this change with a few other in a very breaking version 1.0 .

@ThatXliner
Copy link
Contributor

On the issue of changing behavior, you might want to replace the maybe_placeholders argument to something different entirely for 1.x. Maybe change the name to placeholder_behavior and the accepted values to an enum.

@ThatXliner
Copy link
Contributor

ThatXliner commented Jun 28, 2021

@ThatXliner Our current plan is to add this change with a few other in a very breaking version 1.0 .

Oh ok. In that case, you might want to update the docs, as well.

@erezsh
Copy link
Member

erezsh commented Jun 28, 2021

On the issue of changing behavior, you might want to replace the maybe_placeholders argument to something different entirely for 1.x. Maybe change the name to placeholder_behavior and the accepted values to an enum.

That's a great idea! thanks for the suggestion.

@erezsh
Copy link
Member

erezsh commented Jun 28, 2021

P.S. the question is, should we even support the old behavior in 1.x, or just keep the new one?

@ThatXliner
Copy link
Contributor

Ok, so I here's an example of using the old behavior for conciseness: https://github.com/ThatXliner/tylaireum/blob/4ae6772a0660cb75c307edc7d2bdff94667b8cdc/tylaireum/grammar.lark#L59

If I wanted to be compatible with the new behavior, I'd need to make another rule. (the ...s rules like fdec_args vs fdec_arg is to please ast_utils. Not relevant)

@erezsh
Copy link
Member

erezsh commented Jun 28, 2021

@ThatXliner Maybe you misunderstood. [IDENTIFIER "="] will still be one argument when matched, because "=" gets discarded.

@ThatXliner
Copy link
Contributor

ThatXliner commented Jun 28, 2021

@ThatXliner Maybe you misunderstood. [IDENTIFIER "="] will still be one argument when matched, because "=" gets discarded.

Oh right. Do terminals starting with _ get discarded, too? If so, then I think changing the behavior entirely would be a plus. (I would need to modify my grammar a bit but I can't think of a situation where the new behavior would be a downside)

@ornariece
Copy link
Contributor Author

ornariece commented Sep 3, 2021

i've encountered a usecase where my fix doesn't suffice.

p = Lark("""!start: ["a"] | ["b"] ["c"] """, maybe_placeholders=True)
p.parse("").children

should return [None, None]. currently it returns [None], which is wrong (following the same logic as before).

i've taken a look at what's causing this, and from what i can tell this looks like the fix is gonna be much more complex than my previous one. i haven't had the time to dive into it, and it's likely i'm gonna have some trouble with it.
help much appreciated :)

@MegaIng
Copy link
Member

MegaIng commented Sep 3, 2021

should return [None, None].

I disagree. While 2 children is the maximal account that can occur, it can also return just one (just "a"), so it is ambiguous. That is just a badly designed rule. (there are a total of 3 ways how the empty string could be parse).

This is also especially none-bad since a simple rewrite of the rule to ["a" | ["b"] ["c"]] produces the 'correct' (according to you) result. And even that rule can be made cleaner by dropping the outer []: "a" | ["b"] ["c"] is also equivalent and produces the same "correct" result.

@ornariece
Copy link
Contributor Author

(i edited my previous comment, the outer [] was an unwanted leftover from a copy-paste)
i somehow missed the obvious, you're right "a" | ["b"] ["c"] is equivalent and the proper way to declare this grammar.
that said, i'd still argue ["a"] | ["b"] ["c"] should output 2 children in any case. but as i said, the "fix" is probably too complex for a gain that is negligible since there is a better and functional way to write such a grammar.

@erezsh
Copy link
Member

erezsh commented Sep 3, 2021

@ornariece I'm inclined to rebase this PR over v1.0 instead, and then merge it. Any objections?

If you want it to get to the current master, we can try the maybe_placeholders='per_symbol' route. But I don't think it will be much longer before v1.0 is in master anyway.

@ornariece
Copy link
Contributor Author

@ornariece I'm inclined to rebase this PR over v1.0 instead, and then merge it. Any objections?

fine by me!

@ornariece
Copy link
Contributor Author

another example where the current fix doesn't suffice, this time the grammar should be minimal:
"a" ["b"] | ["a"] "b" "c" should output three children in any case, but it outputs two children when a or ab are parsed.

one workaround would be to add redundancy to the rule:
"a" ["b" ["c"]] | ["a"] "b" "c"
but that is suboptimal.
now is working on a fix worth it? i'm not sure

@erezsh
Copy link
Member

erezsh commented Sep 4, 2021

@ornariece I would say that's the intended behavior.

In a rule like some_rule: a b | a b c, the result can be either two children or three. And that's fine, it's the responsibility of the grammar author to make sure they are balanced, if they so wish.

Perhaps one could argue that this balancing should happen automatically by adding Nones or some other way. But that's a whole separate issue in Lark, and not related to this specific fix.

@ornariece
Copy link
Contributor Author

hmmm true, i didn't think of it as a separate behavior.

one could argue some_rule: a b | a b c should output 3 children in any case (by providing a None) when maybe_placeholders=True, but you're right, let's stick to this specific fix here

@erezsh
Copy link
Member

erezsh commented Sep 4, 2021

one could argue some_rule: a b | a b c should output 3 children in any case

The feel free to make the argument, because I think it's not too hard to implement.

@ornariece
Copy link
Contributor Author

well the thing is you made it so that only items enclosed in [] are to generate Nones if there is no match.
i'm not sure why you went with that restriction, but by that logic a b | a b c which is equivalent to a b c? shouldn't generate a None either.
so really the argument is about whether or not maybe_placeholders=True's role should be to generate a fixed amount of children in any case, or not

@MegaIng
Copy link
Member

MegaIng commented Sep 5, 2021

If you want a b | a b c to produce a, b, None as children, make that explicit. We can provided a user facing ability to add padding, but I would not make any assumptions by default, even when maybe_placeholders=True, or even if we added another option. What about a c | a b c?

@ornariece
Copy link
Contributor Author

the only assumption i'm expecting is "if the user set maybe_placeholders to True, then he wants a fixed amount of children, so let's take the maximum amount, of children there can be, and fill the list of matched items with Nones until we reach that amount".
meaning a c | a b c would produce a, c, None or a, b, c, and i would say that's fine, since it's the user's responsability to care for the order in which he wants the children to appear in the transformer method

@ornariece
Copy link
Contributor Author

ornariece commented Sep 5, 2021

ok no this is not the only assumption, since you'd still want a [b] c to produce a, None, c and not a, c, None

@MegaIng
Copy link
Member

MegaIng commented Sep 5, 2021

If is the user responsibility, make them express it in the grammar. Also, what about a+ | b c d e? We can warn about mismatching number of children (Maybe even error with some kind of strict option); and we an provided a padding symbol (Maybe _?) that will produce a None, but I wouldn't make assumptions about how many children they want. Also, what about a b? | c? We currently have ? to explicitly not provided None.

@ornariece
Copy link
Contributor Author

If is the user responsibility, make them express it in the grammar

i agree with that

what about a+ | b c d e?

good point. the same issue arises for [a+] though, which currently would produce a single None i think?

@erezsh erezsh merged commit a12effe into lark-parser:master Sep 13, 2021
@erezsh
Copy link
Member

erezsh commented Sep 13, 2021

Thanks, and sorry it took so long!

@ornariece
Copy link
Contributor Author

@erezsh @MegaIng currently ["a" _some_rule] would output a single None. i'd be inclined to modify that behavior so that it outputs 2 Nones

@MegaIng
Copy link
Member

MegaIng commented Sep 16, 2021

@ornariece No. What if _some_rule outputs more content? We can maybe add a check so that whatever _some_rule outputs get's added (If no recursion is detected)

@ornariece
Copy link
Contributor Author

What if _some_rule outputs more content?

well same issue as with "a"+, which currently outputs a single None
but yea using the output is probably better, although maybe a bit confusing for the user?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants