Multi-stack parser

## Problem

[Source](http://softwaremaniacs.org/blog/2014/07/26/highlight-js-what-next/en/):

> Our current syntax can't express things defined as a sequence of other things, like this:
> 
> ```
> function ::= <title> '=' <params> '->' <body>
> ```
> 
> The problem here is that you don't know that you're in a function definition until you get to the -> symbol. Our parser can start a new parsing mode based only on a single starting lexeme: an opening quote, a keyword, a number, etc.
> 
> To work around that we use a horrible kludge:
> 1. Match the whole body of the mode with a single regex.
> 2. Start a new mode at the beginning of the matched string.
> 3. Return the whole thing back to the parser.
> 4. Parse it again, by the rules of the new mode.
> 
> Not only it's ugly, it also works only when we're lucky that the whole body can actually be parsed by a regex.
## Solution design

We need to implement a "multi-stack" parser: instead of keeping track of a single stack of parsing modes the parser should be able to keep multiple stacks representing alternative parsing strategies to which it can go back if the current stack dead-ends for some reason. This effectively becomes a tree:

```
                                      stacks
                                      +----+
[root]---[ ]---[ ]---[ ]---[top]  <---|    |
          |           |               +----+
          |           +---[x]     <---|    | 
          |                           +----+
          +---[x]                 <---|    |
          |                           +----+
          +---[x]                 <---|    |
                                      +----+
```

This is an inverted tree with the `stacks` array keeping references to alternative parsing strategies (stacks) with each mode having a reference to its parent.

The topmost (or the rightmost) branch is the only long one and it represents the current alternative being parsed. Every other branch is exactly one level deep — a lead node representing a point from which to restart parsing if the current branch fails.

Operations:

**Discover**: upon finding lexemes (one or more) opening new possible modes add them to the beginning of the `stacks` array in the order they're defined. The old `top` is updated with the current index in the source so we could parse farther if all new strategies fail. `stacks[0]` becomes the new `top`.

**Cancel**: upon finding a canceling condition remove `top` from `stacks`, set `top` to the next available stack. If `stacks` is empty, treat as we currently do illegals — drop parsing.

**Commit**: if the mode successfully reaches its `end` lexeme, it is removed from the `top` stack and `stacks` is cleaned up from any items pointing to that mode
## Syntax

Current idea is to have something like this: 

``` js
var PARAMS = {
  className: 'params',
  begin: /\(/, end: /\)/,
  contains: [...]
}

var FUNCTION = {
  className: 'function',
  is: [hljs.TITLE_MODE, PARAMS, /=>/]
}
```

Where `is` array is a syntactic sugar for something like this:

``` js
{
  className: 'title',
  starts: {
    cancel: /\S/,
    contains: [
      {
        className: 'params',
        begin: /\(/, end: /\)/,
        starts: {
          cacncel: /\S/,
          contains: [
            {
              begin: /=>/, relevance: 0
            }
          ]
        }
      }
    ]
  }
}
```

It's pretty obvious why we need some syntactic sugar for this :-)

`cancel` is a new attribute, similar to the current `illegal` with the exception that it only cancels the current parsing strategy, not the whole thing.

More syntactic sugar:
- Treat literal strings as keyword lists: `'a b c'` ⇒ `{className: 'keyword', begin: /\b(a|b|c)\b/}`
- Treat literal regexes as lexemes: `/=>/` ⇒ `{begin: /=>/, relevance: 0}`
## Simplifications

Unless I'm very much mistaken, implementing all this will allow us to drop `returnBegin`, `returnEnd` and also `beginKeywords` and make the whole definition syntax less convoluted. Fingers crossed :-)

---

This is a work in progress…


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-stack parser #1140

Problem

Solution design

Syntax

Simplifications

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-stack parser #1140

Description

Problem

Solution design

Syntax

Simplifications

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions