Skip to content

Significant Whitespace #5

@liamoc

Description

@liamoc

So, this is more intended for discussing the potential implications of implementing significant whitespace.

You said on reddit:

If you want to help with significant whitespace, that would be great. The basics are working over on the whitespace branch with a lexer similar to the one you describe, with INDENT, OUTDENT, and "\n" tokens. The tricky case is the one I mentioned above:

elements.each(el =>
  el.click(event =>
    el.show() if e.active))

The two outer functions need to be closed before the closing parentheses and big OUTDENT occurs. I'm a bit at a loss for how to implement this in the grammar.

If you produced two outdent tokens, that would work. The way to do this is deceptively simple.

Currently, I assume your lexer has some idea of a current indentation level, and if this increases, then an INDENT is produced, if it decreases, an OUTDENT is produced, and otherwise a standard \n is produced.

The way to adapt it is to instead stop caring about indentation and start caring about alignment. This is how haskell does it to great success. Most of my examples in code will not be in ruby but a haskell-like, so hopefully you know haskell.

The way to implement it in your lexer is to have a stack of indentations. Looking at this example:

a
  b
  c
     d
         e
     f
g

This should be grouped like this:

a(b c(d(e)f))g

So, let's follow through with my stack indentation idea and see where that goes.

First we have a, it is at zero indentation level, so we lex that as usual. Next, we have an INDENT of two spaces, so we push 2 to the stack, then emit the indent token as per usual:

2:[] -- stack 
a INDENT

Then, we lex b. c isn't a INDENT or OUTDENT so the stack is unchanged and c is lexed as per usual.

2:[]
a INDENT b NEWLINE c

Then we have an indentation of 3 spaces, so we push 3 to the stack, and emit an indent token as per usual.

3:2:[] -- stack
a INDENT b NEWLINE c INDENT

Then, we have d, which we lex as per normal, then an indent of 4 spaces, so we emit an indent and add to the stack.

4:3:2:[] -- stack
a INDENT b NEWLINE c INDENT d INDENT

Finally, we lex e, and then we have the fun stuff. We first have an outdent of 4 spaces. If this outdent was anything less than 4, we'd throw an error. As it is, it is equal to 4, so that's fine. We subtract 4 from the top of the stack, and get zero.

 0:3:2:[] -- stack
a INDENT b NEWLINE c INDENT d INDENT e 

Then, we pop the zero off the stack, and emit an OUTDENT token. We also lex f as per usual.

3:2:[] --stack
a INDENT b NEWLINE c INDENT d INDENT e OUTDENT f

Finally we hit the major hurdle you were facing. We have an outdent of 5 spaces. So, we generalise the above algorithm and subtract 5 spaces from the whole stack, starting at the left:

0:0:[] -- stack
a INDENT b NEWLINE c INDENT d INDENT e OUTDENT f

Then, for each zero at the top of the stack, we emit an OUTDENT
[] -- stack
a INDENT b NEWLINE c INDENT d INDENT e OUTDENT f OUTDENT OUTDENT

Then, we lex g

a INDENT b NEWLINE c INDENT d INDENT e OUTDENT f OUTDENT OUTDENT g

Note, if i replace INDENT with ( and OUTDENT with ) and NEWLINE with a space, then you get:

 a(b c(d(e)f))g

Which is the exact grouping we wanted.

This way, you get exactly 1 OUTDENT token for every INDENT token. Hope this helps.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions