Skip to content

Multi-stack parser #1140

@isagalaev

Description

@isagalaev

Problem

Source:

Our current syntax can't express things defined as a sequence of other things, like this:

function ::= <title> '=' <params> '->' <body>

The problem here is that you don't know that you're in a function definition until you get to the -> symbol. Our parser can start a new parsing mode based only on a single starting lexeme: an opening quote, a keyword, a number, etc.

To work around that we use a horrible kludge:

  1. Match the whole body of the mode with a single regex.
  2. Start a new mode at the beginning of the matched string.
  3. Return the whole thing back to the parser.
  4. Parse it again, by the rules of the new mode.

Not only it's ugly, it also works only when we're lucky that the whole body can actually be parsed by a regex.

Solution design

We need to implement a "multi-stack" parser: instead of keeping track of a single stack of parsing modes the parser should be able to keep multiple stacks representing alternative parsing strategies to which it can go back if the current stack dead-ends for some reason. This effectively becomes a tree:

                                      stacks
                                      +----+
[root]---[ ]---[ ]---[ ]---[top]  <---|    |
          |           |               +----+
          |           +---[x]     <---|    | 
          |                           +----+
          +---[x]                 <---|    |
          |                           +----+
          +---[x]                 <---|    |
                                      +----+

This is an inverted tree with the stacks array keeping references to alternative parsing strategies (stacks) with each mode having a reference to its parent.

The topmost (or the rightmost) branch is the only long one and it represents the current alternative being parsed. Every other branch is exactly one level deep — a lead node representing a point from which to restart parsing if the current branch fails.

Operations:

Discover: upon finding lexemes (one or more) opening new possible modes add them to the beginning of the stacks array in the order they're defined. The old top is updated with the current index in the source so we could parse farther if all new strategies fail. stacks[0] becomes the new top.

Cancel: upon finding a canceling condition remove top from stacks, set top to the next available stack. If stacks is empty, treat as we currently do illegals — drop parsing.

Commit: if the mode successfully reaches its end lexeme, it is removed from the top stack and stacks is cleaned up from any items pointing to that mode

Syntax

Current idea is to have something like this:

var PARAMS = {
  className: 'params',
  begin: /\(/, end: /\)/,
  contains: [...]
}

var FUNCTION = {
  className: 'function',
  is: [hljs.TITLE_MODE, PARAMS, /=>/]
}

Where is array is a syntactic sugar for something like this:

{
  className: 'title',
  starts: {
    cancel: /\S/,
    contains: [
      {
        className: 'params',
        begin: /\(/, end: /\)/,
        starts: {
          cacncel: /\S/,
          contains: [
            {
              begin: /=>/, relevance: 0
            }
          ]
        }
      }
    ]
  }
}

It's pretty obvious why we need some syntactic sugar for this :-)

cancel is a new attribute, similar to the current illegal with the exception that it only cancels the current parsing strategy, not the whole thing.

More syntactic sugar:

  • Treat literal strings as keyword lists: 'a b c'{className: 'keyword', begin: /\b(a|b|c)\b/}
  • Treat literal regexes as lexemes: /=>/{begin: /=>/, relevance: 0}

Simplifications

Unless I'm very much mistaken, implementing all this will allow us to drop returnBegin, returnEnd and also beginKeywords and make the whole definition syntax less convoluted. Fingers crossed :-)


This is a work in progress…

Metadata

Metadata

Assignees

Labels

autocloseFlag things to future autoclose.big picturePolicy or high level discussionenhancementAn enhancement or new featureparser

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions