-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Problem
Our current syntax can't express things defined as a sequence of other things, like this:
function ::= <title> '=' <params> '->' <body>
The problem here is that you don't know that you're in a function definition until you get to the -> symbol. Our parser can start a new parsing mode based only on a single starting lexeme: an opening quote, a keyword, a number, etc.
To work around that we use a horrible kludge:
- Match the whole body of the mode with a single regex.
- Start a new mode at the beginning of the matched string.
- Return the whole thing back to the parser.
- Parse it again, by the rules of the new mode.
Not only it's ugly, it also works only when we're lucky that the whole body can actually be parsed by a regex.
Solution design
We need to implement a "multi-stack" parser: instead of keeping track of a single stack of parsing modes the parser should be able to keep multiple stacks representing alternative parsing strategies to which it can go back if the current stack dead-ends for some reason. This effectively becomes a tree:
stacks
+----+
[root]---[ ]---[ ]---[ ]---[top] <---| |
| | +----+
| +---[x] <---| |
| +----+
+---[x] <---| |
| +----+
+---[x] <---| |
+----+
This is an inverted tree with the stacks
array keeping references to alternative parsing strategies (stacks) with each mode having a reference to its parent.
The topmost (or the rightmost) branch is the only long one and it represents the current alternative being parsed. Every other branch is exactly one level deep — a lead node representing a point from which to restart parsing if the current branch fails.
Operations:
Discover: upon finding lexemes (one or more) opening new possible modes add them to the beginning of the stacks
array in the order they're defined. The old top
is updated with the current index in the source so we could parse farther if all new strategies fail. stacks[0]
becomes the new top
.
Cancel: upon finding a canceling condition remove top
from stacks
, set top
to the next available stack. If stacks
is empty, treat as we currently do illegals — drop parsing.
Commit: if the mode successfully reaches its end
lexeme, it is removed from the top
stack and stacks
is cleaned up from any items pointing to that mode
Syntax
Current idea is to have something like this:
var PARAMS = {
className: 'params',
begin: /\(/, end: /\)/,
contains: [...]
}
var FUNCTION = {
className: 'function',
is: [hljs.TITLE_MODE, PARAMS, /=>/]
}
Where is
array is a syntactic sugar for something like this:
{
className: 'title',
starts: {
cancel: /\S/,
contains: [
{
className: 'params',
begin: /\(/, end: /\)/,
starts: {
cacncel: /\S/,
contains: [
{
begin: /=>/, relevance: 0
}
]
}
}
]
}
}
It's pretty obvious why we need some syntactic sugar for this :-)
cancel
is a new attribute, similar to the current illegal
with the exception that it only cancels the current parsing strategy, not the whole thing.
More syntactic sugar:
- Treat literal strings as keyword lists:
'a b c'
⇒{className: 'keyword', begin: /\b(a|b|c)\b/}
- Treat literal regexes as lexemes:
/=>/
⇒{begin: /=>/, relevance: 0}
Simplifications
Unless I'm very much mistaken, implementing all this will allow us to drop returnBegin
, returnEnd
and also beginKeywords
and make the whole definition syntax less convoluted. Fingers crossed :-)
This is a work in progress…