pgen

A PEG tokenizer/parser-generator.

This is the program that generates the tokenizer and parser for the Daisho programming language.

Given the specification of a grammar, pgen generates a very fast tokenizer and parser for that grammar.

Usage Example

Token Syntax

// Keywords
CLASS: "class";
PLUS:  "+";

// A state machine that tokenizes single line comments.
SLCOM: (2, 3) {
  (0, '/') -> 1;
  (1, '/') -> 2;
  (2, [^\n]) -> 2;
  (2, [\n]) -> 3;
};

// A state machine that tokenizes whitespace.
WS: 1 {
  ((0, 1), [ \n\r\t]) -> 1;
};

/* Single and multiline C comments are allowed in `.peg` files. */

Parser Syntax

/* pgen's syntax, written in itself. */

PERCENT: "%";
LESSTHAN "<";
GREATERTHAN: ">";
COMMA: "";
// And so on...

grammar <- (directive / definition)*

directive <- PERCENT LOWERIDENT (&(!eol) WS)* eol

definition <- LOWERIDENT variables? ARROW slashexpr

variables <- LESSTHAN variable (COMMA variable)* GREATERTHAN

variable <- (!(GREATERTHAN / COMMA) wildcard)*

slashexpr <- modexprlist (DIV modexprlist)*

modexprlist <- modexpr*

modexpr <- (LOWERIDENT COLON)?                    // Variable assignment
           (AMPERSAND / EXCLAIMATION)*            // Operators
           baseexpr                               // The modified expression
           (QUESTION / STAR / PLUS)*              // More Operators
           (PIPE (STRING / baseexpr))?            // Error handlers

baseexpr <- UPPERIDENT                            // Token to match
          / LOWERIDENT !(LT / ARROW)              // Rule to call
          / CODEEXPR                              // Code to execute
          / OPENPAREN slashexpr CLOSEPAREN

%node EOL
eol <- {
    bool iseol = 0;
    if (ctx->pos >= ctx->len) {
      iseol = 1;
    } else if (ctx->tokens[ctx->pos - 1].line < ctx->tokens[ctx->pos].line) {
      iseol = 1;
    }
    ret = iseol ? leaf(EOL) : NULL;
}

wildcard <- {
    rule = pgen_astnode_leaf(ctx->alloc, ctx->tokens[ctx->pos++].kind);
}

For more a more precise description of the grammar, see pgen_grammar.peg. Or just find me and ask me.

The syntax of pgen is based on the paper "Parsing Expression Grammars: A Recognition-Based Syntactic Foundation" by Bryan Ford. This specific parser is inspired by packcc by Arihiro Yoshida. You may see many commonalities. The main difference is that while packcc parses at the source level, pgen introduces a tokenizer and parses the token stream instead.

Operators:

/ - Try to match the left side, then try to match the right side. Returns the first that matches. Otherwise fail.
& - Try to parse, perform the match, but rewind back to the starting position and return SUCC. Otherwise fail as usual.
! - Try to parse, and then rewind. Returns SUCC on no match and fail on match.
? - Optionally match, returning either the result, or SUCC if no match. Does not cause the rule to fail.
* - Match zero or more. Returns SUCC.
+ - Match one or more. Returns SUCC, or fails if no matches.
() - Matches if all expressions inside match. Returns SUCC or the single match within if there's only one.
{} - Code to insert into the parser. Assign to ret for the return value of this expression, or rule for the rule.
: - Capture the info from a match inside a variable in the current rule.
| - Register an error using the string or expression on the right, and exit all parsing.

Directives:

%oom - Define the action that should be taken when out of memory
%node - Define an ASTNode kind
%preinclude - Include a file before astnode, but after support libs
%include - Include a file after astnode, but before the parser
%postinclude - Include a file after the parser
%predefine - #define something before astnode, but after support libs
%define - #define something after astnode, but before the parser
%postdefine - #define something after the parser
%precode - Insert code before astnode, but after support libs
%code - Insert code after astnode, but before the parser
%postcode - Insert code after the parser
%extra - Add fields to the astnode
%extrainit - Add initialization to the astnode
%errextra - Add fields to errors
%errextrainit - Add initialization to errors
%ignore - Don't add token to list of tokens parsed in python bindings

C Builtins:

rec(label) - Record the parser's state to a label
rew(label) - Rewind the parser's state to a label
node(kind, children...) - Create an astnode with a kind name and fixed number of children
kind(name) - Get the enum value of an astnode kind name
list(kind) - Create an astnode with a kind name and a dynamic number of children
leaf(kind) - Create an astnode with no children
add(list, node) - Add an astnode as a child to an astnode created by list()
has(node) - 0 if the node is NULL or SUCC, 1 otherwise.
repr(node, ofnode) - Set the string representation of the current node to another node's
srepr(node, string) - Set the string representation of node to a cstring
cprepr(node, cps, len) - Set the string representation of node to a codepoint string
expect(kind, cap) - Parses a token the same way TOKEN does. Returns the astnode if cap(tured).

Error Logging Builtins

INFO(msg) - Log an error to ctx->errlist with the position and severity 0.
WARNING(msg) - Log an error to ctx->errlist with the position and severity 1.
ERROR(msg) - Log an error to ctx->errlist with the position and severity 2.
FATAL(msg) - Log an error to ctx->errlist with the position and severity 3, and sets ctx->exit = 1.

Notes:

There's documentation, but realistically you're not going to figure everything out on your own. Talk to me, submit an issue, send me an email, or find me on Discord, and I can walk you through how to use it.

C code in Code expressions are parsed by matching left and right curly braces. Therefore, it could get confused if you write something like { ret = ...; ret->str = "I have a } in my string"; }. Instead of using "{" or "}", you can use the macros LBSTR/RBSTR. Then you would have { ret = ...; ret->str = "I have a " RBSTR "in my string";}

Generated Parser C API Example:

See examples/pl0.c for the full example put together.

1. Load your file into a cstring, then decode it with the UTF8 -> UTF32 decoder.

char *input_str = NULL;
size_t input_len = 0;
readFile("pl0.pl0", &input_str, &input_len);

codepoint_t *cps = NULL;
size_t cpslen = 0;
if (!UTF8_decode(input_str, input_len, &cps, &cpslen))
  fprintf(stderr, "Could not decode to UTF32.\n"), exit(1);

2. Initialize the tokenizer, then run the tokenizer.

You will have to create some sort of list data structure to hold the tokens. Here, we add a token to that list with add_tok. You will have to roll your own.

This is also the step where you can discard any tokens you don't want. This you can parse comments and whitespace as tokens, and then ignore them.

The .kind member of your token struct will contain what kind of token it is, as described by your .tok file. When there are no more tokens left to parse, .kind of the returned token will be PL0_TOK_STREAMEND. You can also create and append your own LANG_TOK_STREAMBEGIN token at the beginning, if you wish.

pl0_tokenizer tokenizer;
pl0_tokenizer_init(&tokenizer, cps, cpslen);

pl0_token tok;
do {
  tok = pl0_nextToken(&tokenizer);

  // Discard whitespace, comments, and end of stream,
  // add other tokens to the list.
  if (!(tok.kind == PL0_TOK_SLCOM | tok.kind == PL0_TOK_MLCOM |
        tok.kind == PL0_TOK_WS | tok.kind == PL0_TOK_STREAMEND))
    add_tok(tok);

} while (tok.kind != PL0_TOK_STREAMEND);

3. Initialize the allocator and parser.

pgen_allocator allocator = pgen_allocator_new();
pl0_parser_ctx parser;
pl0_parser_ctx_init(&parser, &allocator, toklist.buf, toklist.size);

4. Call a rule to parse an AST.

Any rule can be an entry point for your parser. The function generated for each rule has the signature:

lang_astnode_t *lang_parse_rulename(lang_parser_ctx* ctx);

For our, pl0 example, program is the rule we want, and we call it to parse the abstract syntax tree like so:

pl0_astnode_t *ast = pl0_parse_program(&parser);

5. When you're done with your AST, clean up whatever memory you used.

pgen_allocator_destroy(&allocator); // The whole AST is freed with the allocator
free(toklist.buf);                  // The list of tokens (provide your own)
free(cps);                          // The file as UTF32
free(input_str);                    // The file as UTF8

More comprehensive documentation on these things will come eventually.

TODO

Design an algorithm for merging state machines
Multiple %node declarations in one
State Machine automaton state reachability analysis
Regex tokenizer rules.
Add a flag to warn on token/astnode kinds not used in the parser
Rethink Token/Node print functions
PGEN_RUNTIME_INCLUDE scope guard
Rewrite memory allocator with GC and option to leak.
Compiler option to generate runner C file
%drop tokens
%main and %input, to complement daisho-explorer.

License

The license for pgen is GPLv3. The license applies only to the files already in this repository. The code that you generate using pgen belongs to you (or whoever has the copyright to the .peg file it was generated from.

However, if you modify or distribute pgen itself then that must still follow the rules of the GPL.

If these terms are not acceptable for you, please contact me with your use case.

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.circleci		.circleci
examples		examples
src		src
.gitignore		.gitignore
.refast.json		.refast.json
.testast.json		.testast.json
LICENSE		LICENSE
README.md		README.md
build		build
pgen_example.gif		pgen_example.gif
pgen_grammar.txt		pgen_grammar.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pgen

Usage Example

Token Syntax

Parser Syntax

Operators:

Directives:

C Builtins:

Error Logging Builtins

Notes:

Generated Parser C API Example:

1. Load your file into a cstring, then decode it with the UTF8 -> UTF32 decoder.

2. Initialize the tokenizer, then run the tokenizer.

3. Initialize the allocator and parser.

4. Call a rule to parse an AST.

5. When you're done with your AST, clean up whatever memory you used.

TODO

License

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

apaz-cli/pgen

Folders and files

Latest commit

History

Repository files navigation

pgen

Usage Example

Token Syntax

Parser Syntax

Operators:

Directives:

C Builtins:

Error Logging Builtins

Notes:

Generated Parser C API Example:

1. Load your file into a cstring, then decode it with the UTF8 -> UTF32 decoder.

2. Initialize the tokenizer, then run the tokenizer.

3. Initialize the allocator and parser.

4. Call a rule to parse an AST.

5. When you're done with your AST, clean up whatever memory you used.

TODO

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages