An experiment with parser combinators, to replace the current lexer/parser of ArkScript.
This project has multiple goals:
- making a more extensible parser than the current one Ark has;
- removing weird edge cases the current parser has ;
- reducing the number of bugs the parser has ;
- easier generation of error contexts
You need CMake >= 3.24 and a C++17 capable compiler (eg Clang 14).
cmake -Bbuild -DCMAKE_BUILD_TYPE=Debug
cmake --build build
build/parser <filename>Subparsers:
- let, mut, set
- handle nodes as values
- del
- condition
- handle nodes as condition
- handle nodes as values
- loop
- handle nodes as conditions
- handle nodes as body
- import
- begin block
- function
- handle nodes as body
- macro
- handle nodes as body
- atom
- number
- floating point 1.2
- scientific numbers 12e+14, 4.5e+16
- string
- handle
\uxxxxx,\Uxxxxx,\xabcin strings - handle other espace sequences: n, r, t, a, b, f, 0, , "
- handle
- boolean
- nil
- symbol
- number
- comment
- comments in blocks and not only top level ones
- function calls
- anonymous calls: ((fun () (print 1)))
- identifiers
- symbol
- capture
- dot notation
- dot notation after call: (@ list 14).field
- non alnum identifiers (
+,!=,>=...)
- special syntax for (list ...): [...]
Error context generation:
- better messages
- what went wrong at the syntax level
- what went wrong at the language level
- possible fix
- sometimes the wrong token is underlined Example:
ERROR
Package name expected after '.'
At ' ' @ 1:12
1 | (import a. )
| ^
Misc:
- handle UTF-8
- store codepoints in
struct { unsigned int cp; std::string repr; }; - homemade
std::is"char category"(codepoint) - decode UTF8 to calculate correctly the columns
- store codepoints in
This is for ArkScript, but some things had to change for the next version of the language, implemented by this parser.
- quote is no longer supported, use functions with no arguments instead
- import do not work the same way as before:
(import "path.ark")won't work, we are using a package like syntax now:
(import a)
(import a.b) # everything is prefixed by b
(import foo.bar.egg)
(import foo:*) # everything is imported in the current scope
(import foo.bar :a :b) # we import only a and b from foo.bar, in the current scope- fields aren't chained in the AST:
(Symbol:a GetField:b GetField:c)was the old way of having aa.b.cin the AST, now we haveField(Symbol:a Symbol:b Symbol:c), the node holding the field being a list of symbols