-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
So, this is more intended for discussing the potential implications of implementing significant whitespace.
You said on reddit:
If you want to help with significant whitespace, that would be great. The basics are working over on the whitespace branch with a lexer similar to the one you describe, with INDENT, OUTDENT, and "\n" tokens. The tricky case is the one I mentioned above:
elements.each(el =>
el.click(event =>
el.show() if e.active))
The two outer functions need to be closed before the closing parentheses and big OUTDENT occurs. I'm a bit at a loss for how to implement this in the grammar.
If you produced two outdent tokens, that would work. The way to do this is deceptively simple.
Currently, I assume your lexer has some idea of a current indentation level, and if this increases, then an INDENT is produced, if it decreases, an OUTDENT is produced, and otherwise a standard \n is produced.
The way to adapt it is to instead stop caring about indentation and start caring about alignment. This is how haskell does it to great success. Most of my examples in code will not be in ruby but a haskell-like, so hopefully you know haskell.
The way to implement it in your lexer is to have a stack of indentations. Looking at this example:
a
b
c
d
e
f
g
This should be grouped like this:
a(b c(d(e)f))g
So, let's follow through with my stack indentation idea and see where that goes.
First we have a, it is at zero indentation level, so we lex that as usual. Next, we have an INDENT of two spaces, so we push 2 to the stack, then emit the indent token as per usual:
2:[] -- stack
a INDENT
Then, we lex b. c isn't a INDENT or OUTDENT so the stack is unchanged and c is lexed as per usual.
2:[]
a INDENT b NEWLINE c
Then we have an indentation of 3 spaces, so we push 3 to the stack, and emit an indent token as per usual.
3:2:[] -- stack
a INDENT b NEWLINE c INDENT
Then, we have d, which we lex as per normal, then an indent of 4 spaces, so we emit an indent and add to the stack.
4:3:2:[] -- stack
a INDENT b NEWLINE c INDENT d INDENT
Finally, we lex e, and then we have the fun stuff. We first have an outdent of 4 spaces. If this outdent was anything less than 4, we'd throw an error. As it is, it is equal to 4, so that's fine. We subtract 4 from the top of the stack, and get zero.
0:3:2:[] -- stack
a INDENT b NEWLINE c INDENT d INDENT e
Then, we pop the zero off the stack, and emit an OUTDENT token. We also lex f as per usual.
3:2:[] --stack
a INDENT b NEWLINE c INDENT d INDENT e OUTDENT f
Finally we hit the major hurdle you were facing. We have an outdent of 5 spaces. So, we generalise the above algorithm and subtract 5 spaces from the whole stack, starting at the left:
0:0:[] -- stack
a INDENT b NEWLINE c INDENT d INDENT e OUTDENT f
Then, for each zero at the top of the stack, we emit an OUTDENT
[] -- stack
a INDENT b NEWLINE c INDENT d INDENT e OUTDENT f OUTDENT OUTDENT
Then, we lex g
a INDENT b NEWLINE c INDENT d INDENT e OUTDENT f OUTDENT OUTDENT g
Note, if i replace INDENT with ( and OUTDENT with ) and NEWLINE with a space, then you get:
a(b c(d(e)f))g
Which is the exact grouping we wanted.
This way, you get exactly 1 OUTDENT token for every INDENT token. Hope this helps.