Provide custom LOOKAHEAD on LALR grammar #1247

fabioz · 2023-02-07T14:08:53Z

By default the LALR grammar can have only a single lookahead, but it'd be really nice if it could have a custom lookahead on specific cases (I got used to JavaCC which implements this with something as a LOOKAHEAD(2) in the proper place to avoid the restriction).

The use case I have is below. From what I see, apparently the ?identifier: NAME (WS NAME|WS NAME_CONT)* sees the WS and takes that route but can't see that the whole construct is actually optional and should not keep matching (in JavaCC I'd put a LOOKAHEAD(2) there and it'd try to make the whole match and if it matched just the first rule but not the 2nd it'd be Ok).

p.s.: although earley works for this particular construct it doesn't work for the full grammar I'm working at, so, using it isn't really a solution...

Error

My name    param 1 passed
        ^
Expected one of: 
	* NAME_CONT
	* _NEWLINE
	* NAME

Previous tokens: Token('WS', ' ')

Sample code

from lark.indenter import Indenter
from lark import Lark


class PythonIndenter(Indenter):
    NL_type = "_NEWLINE"
    OPEN_PAREN_types = ["LPAR", "LSQB", "LBRACE"]
    CLOSE_PAREN_types = ["RPAR", "RSQB", "RBRACE"]
    INDENT_type = "_INDENT"
    DEDENT_type = "_DEDENT"
    tab_len = 8


lark_spec = Lark(
    r"""
file_input: (_NEWLINE | root_stmt)*
?root_stmt: func_block

func_block:  BLOCK WS* "Function" WS* BLOCK WS* _NEWLINE (func_stmt)*

// i.e.: at least 2 spaces so that we have "Function name    arguments"
func_stmt: identifier WS WS+ parameters? func_suite

parameters: param ("," WS* param)* ("," WS*)?
param: param_name ["=" WS* param_default]
param_name: identifier
param_default: identifier

func_suite: _NEWLINE (_INDENT stmt+ _DEDENT)?

?identifier: NAME (WS NAME|WS NAME_CONT)*
?stmt: identifier _NEWLINE

NAME: /(?!(OR|AND|IN)\b)\b[^\d\W]\w*/
NAME_CONT: /(?!(OR|AND|IN)\b)\b\w+/
BLOCK: /\*\*\* */
WS: /[ ]/
_NEWLINE: ( /\r?\n[ ]*/ | COMMENT )+
COMMENT: /#[^\n]*/

%declare _INDENT _DEDENT
    """,
    parser="lalr",
    lexer="contextual",
    postlex=PythonIndenter(),
    start="file_input",
    keep_all_tokens=True,
    propagate_positions=True,
    debug=True,
)


if __name__ == "__main__":
    lark_spec.parse(
        """
*** Function ***
My name    param 1 passed
    Pass
""",
    )

The text was updated successfully, but these errors were encountered:

erezsh · 2023-02-07T16:41:00Z

I agree, a custom lookahead, aka LALR(k), would be a really nice feature. And a very difficult one to implement correctly.

MegaIng · 2023-02-07T17:38:45Z

Why are you explicitly putting down WS? Since that is ignored anyway, it has no purpose here.

fabioz · 2023-02-07T17:49:05Z

Why are you explicitly putting down WS? Since that is ignored anyway, it has no purpose here.

Humm... probably I don't understand it enough then. Why is it ignored? Is there a way to not ignore it? In this particular grammar I'd like to have 2 spaces as a separator. Is this not possible?

i.e.: The code below would be valid code (as the identifier can have spaces):

Function name Function arg 1, Function arg 2

MegaIng · 2023-02-07T17:50:24Z

oh lol, I thought you had an %ingnore statement in there since you were using the PythonIndenter. That one might break if you aren't ignoreing Inline WS:

fabioz added the enhancement label Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide custom LOOKAHEAD on LALR grammar #1247

Provide custom LOOKAHEAD on LALR grammar #1247

fabioz commented Feb 7, 2023

erezsh commented Feb 7, 2023

MegaIng commented Feb 7, 2023

fabioz commented Feb 7, 2023

MegaIng commented Feb 7, 2023

Provide custom LOOKAHEAD on LALR grammar #1247

Provide custom LOOKAHEAD on LALR grammar #1247

Comments

fabioz commented Feb 7, 2023

erezsh commented Feb 7, 2023

MegaIng commented Feb 7, 2023

fabioz commented Feb 7, 2023

MegaIng commented Feb 7, 2023