Skip to content

TUTO~4a: An AST ready grammar

famished-tiger edited this page Feb 25, 2022 · 9 revisions

What's on the menu?

In the previous iteration, it was said that the grammar covered the complete TOML language.
Why this installment on grammar?

Well, we'll cover here advanced grammar topics that will help later to generate AST (Abstract Syntax Tree).
On the menu:

  • The *, + and ? quantifiers,
  • Grouping symbols; and,
  • Rule tags.

*, + and ? quantifiers

The grammar has changed significantly, already from the start rule:

  rule('toml' => 'expression*').tag 'toml'

If we leave the tag aside, the key change in this rule is the presence of the * sign just after the expression grammar symbol.
As in regular expressions, the star means 'zero or more' occurrence of the preceding symbol.
Therefore, the above rule in plain English can be read as follows: "A TOML (file) consists of zero or more expressions".

The single rule above replaces the following ones found in the previous iteration:

  rule('toml' => 'expr-list')
  rule('expr-list' => 'expr-list expression')
  rule('expr-list' => '')

In other words, the use of the star quantifier allows us to get rid of the expr-list rules.

The RGN (Rley Grammar Notation) DSL supports the following quantifiers:

Quantifier Meaning Example
? zero of one (i.e. optional) 'integer' => 'sign? DIGIT+'
* zero or more repetition 'toml' => 'expression*'
+ one or more repetition 'unsigned_integer' => 'DIGIT+'

Grouping symbols

The quantifiers are postfix operators, that is, they must be placed after the symbol to be repeated.
But sometimes, one wants to repeat a sequence of symbols instead of one single symbol.
This is where the use of parentheses ( and ) are handy as a mean to group the enclosed symbol and apply the quantifier to the whole.
The new grammar version contains examples of grouping:

  rule('array-values' => 'val (COMMA val)*')

The above rule reads as follows: an array of values consists of one value or more values separated by commas.

Rule tags

If one reads the very first lines of the grammar:

  rule('toml' => 'expression*').tag 'toml'

  rule('expression' => 'keyval')
  rule('expression' => 'table').tag 'table_expr'
  rule('keyval' => 'key EQUAL val').tag 'keyval'
  rule('key' => 'simple-key')
  rule('key' => 'dotted-key').tag 'dotted_key'
  rule('simple-key' => 'QUOTED-KEY').tag 'atomic_literal'
  rule('simple-key' => 'UNQUOTED-KEY').tag 'atomic_literal'

... one can easily spot rules "annotated" with a tag (in reality, these are calls to a tag method).
The tag method associates a name with a given rule. That name, in turn, is used to activate a programmer-defined method. These custom methods allow to run specific actions to build AST nodes related to the grammar rule under consideration.
Details about these tags-based methods will be exposed in a later installment about AST building;

Why are these methods not directly inline with the rules (like many parser generators) do? It's deliberate choice...
By moving the "semantic actions" elsewhere, we end up with:

  1. More readable grammar specifications uncluttered by AST building code A counter-example is the grammar file of the official Ruby language: it's over 14 000 lines long with grammar rules and C code for the semantic actions completely interwoven.
  2. Increased code reuse
    Look at the two last rules: they have the same tags: hence, the same method will be called. That's reuse across rules.
  3. Ability to use the same grammar to build different kinds of AST That's useful if you plan to generate very different results for the same inputs.

In the next installment, we'll detail the building blocks of an AST (Abstract Syntax Tree)