Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide custom LOOKAHEAD on LALR grammar #1247

Open
fabioz opened this issue Feb 7, 2023 · 4 comments
Open

Provide custom LOOKAHEAD on LALR grammar #1247

fabioz opened this issue Feb 7, 2023 · 4 comments

Comments

@fabioz
Copy link
Contributor

fabioz commented Feb 7, 2023

By default the LALR grammar can have only a single lookahead, but it'd be really nice if it could have a custom lookahead on specific cases (I got used to JavaCC which implements this with something as a LOOKAHEAD(2) in the proper place to avoid the restriction).

The use case I have is below. From what I see, apparently the ?identifier: NAME (WS NAME|WS NAME_CONT)* sees the WS and takes that route but can't see that the whole construct is actually optional and should not keep matching (in JavaCC I'd put a LOOKAHEAD(2) there and it'd try to make the whole match and if it matched just the first rule but not the 2nd it'd be Ok).

p.s.: although earley works for this particular construct it doesn't work for the full grammar I'm working at, so, using it isn't really a solution...

Error

My name    param 1 passed
        ^
Expected one of: 
	* NAME_CONT
	* _NEWLINE
	* NAME

Previous tokens: Token('WS', ' ')

Sample code

from lark.indenter import Indenter
from lark import Lark


class PythonIndenter(Indenter):
    NL_type = "_NEWLINE"
    OPEN_PAREN_types = ["LPAR", "LSQB", "LBRACE"]
    CLOSE_PAREN_types = ["RPAR", "RSQB", "RBRACE"]
    INDENT_type = "_INDENT"
    DEDENT_type = "_DEDENT"
    tab_len = 8


lark_spec = Lark(
    r"""
file_input: (_NEWLINE | root_stmt)*
?root_stmt: func_block

func_block:  BLOCK WS* "Function" WS* BLOCK WS* _NEWLINE (func_stmt)*

// i.e.: at least 2 spaces so that we have "Function name    arguments"
func_stmt: identifier WS WS+ parameters? func_suite

parameters: param ("," WS* param)* ("," WS*)?
param: param_name ["=" WS* param_default]
param_name: identifier
param_default: identifier

func_suite: _NEWLINE (_INDENT stmt+ _DEDENT)?

?identifier: NAME (WS NAME|WS NAME_CONT)*
?stmt: identifier _NEWLINE

NAME: /(?!(OR|AND|IN)\b)\b[^\d\W]\w*/
NAME_CONT: /(?!(OR|AND|IN)\b)\b\w+/
BLOCK: /\*\*\* */
WS: /[ ]/
_NEWLINE: ( /\r?\n[ ]*/ | COMMENT )+
COMMENT: /#[^\n]*/

%declare _INDENT _DEDENT
    """,
    parser="lalr",
    lexer="contextual",
    postlex=PythonIndenter(),
    start="file_input",
    keep_all_tokens=True,
    propagate_positions=True,
    debug=True,
)


if __name__ == "__main__":
    lark_spec.parse(
        """
*** Function ***
My name    param 1 passed
    Pass
""",
    )

@erezsh
Copy link
Member

erezsh commented Feb 7, 2023

I agree, a custom lookahead, aka LALR(k), would be a really nice feature. And a very difficult one to implement correctly.

@MegaIng
Copy link
Member

MegaIng commented Feb 7, 2023

Why are you explicitly putting down WS? Since that is ignored anyway, it has no purpose here.

@fabioz
Copy link
Contributor Author

fabioz commented Feb 7, 2023

Why are you explicitly putting down WS? Since that is ignored anyway, it has no purpose here.

Humm... probably I don't understand it enough then. Why is it ignored? Is there a way to not ignore it? In this particular grammar I'd like to have 2 spaces as a separator. Is this not possible?

i.e.: The code below would be valid code (as the identifier can have spaces):

Function name Function arg 1, Function arg 2

@MegaIng
Copy link
Member

MegaIng commented Feb 7, 2023

oh lol, I thought you had an %ingnore statement in there since you were using the PythonIndenter. That one might break if you aren't ignoreing Inline WS:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants