This is a parser generator written in JavaScript that generates a JavaScript parser. It uses an intermediate language called 'H1' for describing parsers, so it can potentially be retargetable to other languages/environments.
Features include:
- Documentation.
- Portable architecture.
- Small size.
The executable portion of the code is located in the file releases/spg.js, which can be obtained by cloning the github repository located at https://github.com/raymond1/simple-parser-generator/. The installation instructions for a NodeJS installation and for a browser installation are shown below.
- Make a new package.json file. This can be done with the command
npm init
and pressing enter through all the prompts to use the default options. - Add or set the "type" attribute in the package.json file to the value "module".
- In a terminal, run the command
npm install git+https://github.com/raymond1/simple-parser-generator.git
- At this point, the simple parser generator should be installed. To use it, create a file called index.js and use an import statement. For example:
import {ParserGenerator} from 'simple-parser-generator'
- Set up a web server that can serve HTML and JavaScript pages with the correct Content-Type headers.
- Create a small website containing an index.html file and put it into the document root or public_html folder or other folder where your web server will be serving.
- Clone the https://github.com/raymond1/simple-parser-generator repository.
- Copy the file releases/spg.js into the folder that your web server is serving.
- In your index.html file, add the following just before the end of your body tag:
<script type="importmap">
{
"imports": {
"simple-parser-generator":"./spg.js"
}
}
</script>
<script type="module">
import {ParserGenerator} from 'simple-parser-generator'
</script>
The following JavaScript code (with line numbers added) shows a very minimal usage of the SPG parser generator:
Listing 1:
1 |import {ParserGenerator, TreeViewer} from 'simple-parser-generator'
2 |let parserGenerator = new ParserGenerator()
3 |let parserSpecification =
4 |`string literal
5 | Hello, world.`
6 |
7 |let parser = parserGenerator.generateParser(parserSpecification)
8 |
9 |let inputString = 'Hello, world.'
10 |let output = parser.parse(inputString)
11 |
12 |let treeViewer = new TreeViewer()
13 |treeViewer.display('text', output)
For a complete example, with actual files, see the demos/basic_usage directory.
Listing 1 shows the basic structure of a program that uses the SPG.
Line 1 is an import statement which brings the 'ParserGenerator' and 'TreeViewer' classes from the simple-parser-generator module. The 'ParserGenerator' class is used to generate parsers. The 'TreeViewer' class is used to display the output of the parser generator in a human readable form. You do not need to use the 'TreeViewer' class and can use a debugger to view the output of the parser generated by the parser generator instead.
Line 2 instantiates a 'ParserGenerator' object. This object is needed to generate parsers.
Lines 3 to 5 assign a string value to the variable 'parserSpecification'. SPG parsers are specified using the H1 language which will be explained later on. Understanding the H1 language is the core part of understanding the SPG.
Line 7 creates a new parser using the specification from lines 3 to 5 and assigns it to the 'parser' variable. The parser specification, given in lines 3 to 5, define a parsing node, which is a structure that will reside in memory after line 7 is executed. Here, the parsing node being defined is of the type 'string literal', which detects if a string is exactly equal to the first parameter of the 'string literal' parsing node, which in this case, was set to the string 'Hello, world.'
Line 9 provides an input string to the parser. Here it is the string 'Hello, world.' It is a coincidence that the input string is equal to the parameter passed in to the 'string literal' parsing node. Most of the time, the input string could potentially be any arbitrary string. The input string passed to the parser represents the test program that needs to be parsed.
Line 10 uses the generated parser to parse the input and stores a tree of output objects in the 'output' variable. Each node of this output tree is called a 'MatchNode'. These output objects can be examined either in a debugger or with the provided TreeViewer, which helps serialize MatchNode trees generated from the SPG into a somewhat human readable format.
When Line 10 is executed, the input string defined in line 9 will be passed to the root node parsing function defined by the parser specification string that was defined in lines 3 to 5. In this case, the root node parsing function that was generated was the 'string literal' function defined in the parser specification.
Line 12 instantiates a new 'TreeViewer' object for help in interpreting the output tree generated by the parser.
Line 13 produces the output shown above. That output shows several things, but are essentially a string representation of the contents of the tree produced by the parser's 'parse' function from line 10.
The most important thing to take away from the above listing is that line 10 from the program listing generates an object that is treelike in form.
After the parser generator has created a parser, and the parser's 'parse' function has been called, the input to the parse function, being a string, will be passed into the parser's root node. The root node is one of the functions that serve as building blocks for parsers in the SPG system. It will correspond to the function that was instantiated during the parser generation operation from line 7 of listing 1. This root node was specified using the H1 language in the parser specification string. In listing 1, the first line of the parser specification string (line 4 of listing 1) indicates in the H1 language that the root node is a 'string literal' type function. On line 10 of listing 1, the root node receives the input string and begins processing it. Once the root node's parsing function has completed, the parser's 'parse' function finishes, returning a tree of output values which can be saved into a JavaScript variable for further processing.
Hopefully, this section has provided an idea of what the Simple Parser Generator does. More explanations and background context will be provided in the following sections.
In the H1 language, there are a limited number of fundamental building block functions that are used for specifying parsers. These building block functions have the following names:
- character class
- string literal
- not
- entire
- sequence
- or
- and
- multiple
- optional
- split
- name
- jump
Each of these building block functions will be laid out in detail below, along with examples of the H1 programming language syntax which is used to specify parsers in the SPG system. H1 itself uses "Space Tree Notation" which is a tree based object notation that you can learn about here:https://github.com/raymond1/space-tree. Controlling the SPG is mostly a matter of understanding how to use the H1 programming language to write parsers. To do that, it is necessary to understand the API.
To help understand the API section, some concepts about the H1 language will be explained here first. Some of these concepts have are better explained in the Space Tree Notation link above.
- Each line in the H1 language is either a function name(as shown above) or an arbitrary string. Each line can be called a node.
- Child nodes of another node are indicated by adding a line underneath the parent node. If the number of leading spaces of the parent node is equal to N, then the number of leading spaces in the child node needs to be N+1. Only functions can have child nodes. String nodes cannot have child nodes.
- During parser execution, the input string will be passed to the root node, which is the first node in the parser specification string.
- All the functions are essentially string functions, and parsing itself is a kind of differentiation operation. In other words, the parsing operators can be thought of very often as filters. When a string is passed into a function, the function is said to 'match' or 'produce a match', which means that a string passes the filter of a particular function. If an input string does not match with a function, it can be thought of as having essentially been rejected by that function.
- Every function produces an output object with two primary properties: 'matchFound', and 'matchString'. matchFound refers to whether or not the input string given to a function produces a match. 'matchString' is the portion of the input string which passed the filter.
- Functions and strings are essentially types of 'expressions' in other programming languages.
Description: This function's purpose is to filter out sets that don't contain letters from a designated character class set. It matches consecutive letters of the input string, starting from the first letter, until it encounters a character that does not belong to the character class.
Match condition: At least the first letter of the input string must belong to the character class string passed in as a child parameter.
Parameters:
- The character class set.
This function takes in one parameter, called the character class set. It is a string value containing one or more characters.
Output object properties:
- matchFound: true if the input string begins with a letter from the character class set. False otherwise.
- matchString: Let N be a number equal to the number of consecutive characters in the input string, starting from the first letter, which belong to the character class set. The matchString property will be the string of length N taken from the beginning of the input string.
Example:
character class
abcde
In the above parser specification, if given the input string 'f', the output object will be {matchFound: false, matchString: ''}. If given the input string 'e', the output object will be {matchFound: true, matchString: 'e'}. If given the input string 'eef', the output object will be {matchFound: true, matchString: 'ee'}.
Description: Determines whether the input string starts with the reference string passed in as parameter 1.
Match condition: The input string must start with the reference string.
Parameters:
- The reference string.
This function takes in a reference string which will be used for comparison. If the input string starts with the reference string, 'matchFound' will be set to true in the output object.
Output object properties:
- matchFound: true if the input string begins with the reference string. False otherwise.
- matchString: If 'matchFound' is true, then matchString' will be equal to the length of the referene string. Otherwise, it will be the empty string.
Example:
string literal abcde
If given the input string 'abcdf', the output object will be {matchFound: false, matchString: ''}. If given the input string 'abcde', the output object will be {matchFound: true, matchString: 'abcde'}. If given the input string 'abcdefg', the output object will be {matchFound: true, matchString: 'abcde'}.
Description: This function takes in one parameter, which will be another function. When given an input string, this function will pass the input string to its child function and return an output object with a matchFound property which will have the opposite truth value to the one produced by its child node.
Match condition: If the child node matches, this function will not match. If the child node does not match, this function matches.
Parameters:
- A function.
Output object properties:
- matchFound: true if its child node's 'matchFound' value is false, and false if its child node's 'matchFound' value is true.
- matchString: equal to the input string if matchFound returns true. Equal to '' if matchFound returns false.
Example:
not
string literal
This is a test.
If given an input string that does not start with 'This is a test.', this function will return an output object with a 'matchFound' value of true. Its 'matchString' value will be equal to the input string. If given an input string that does start with to 'This is a test.', this function will return {matchFound: false, matchString: ''}.
Description:
This function is used to differentiate between a child function that matches with the entire input string and a child function that matches just the beginning. It takes on one parameter, a child function, and returns true only if the child function matches with the input string and the string that was matched by the child function is equal in length to the entire input string.
Match condition: The child node must match with the input string and there must be no additional letters in the input string which are not part of string matched by the child node.
Parameters:
- A function.
Output object properties:
- matchFound: true if the length of the 'matchString' property of the child node's output object is equal to the length of the input string.
- matchString: equal to the length of the input string if matchFound is true. Otherwise, it is equal to ''.
Example:
entire
string literal
Hamilton
Here, if the input string is 'Hamilton', the output object will be {matchFound: true, matchString: 'Hamilton'}. If the input string is 'Hamiltons', the output object will be {matchFound: false, matchString: ''}.
Description:
This function takes in multiple child functions as parameters. It will feed the input string to its first child node. If the first child node matches, the input string is stripped of the portion that was matched by the child and the remainder of the input string is fed to the next child node. This process repeats until the last child node obtains and processes its input string. If at any point a child node fails to match with the input string it receives, the 'sequence' function will not match and produce an output object with a 'matchFound' value set to false. If all child nodes match with the input string, the 'sequence' function will return a string equal to the concatenation of the strings matched from all of its children.
Match condition: A match is produced if all child elements match.
Parameters:
This function is variadic and can take in any number of child functions as parameters.
Output object properties:
- matchFound: true if the 'matchFound' property of each child node is also true.
- matchString: equal to the concatenation of the 'matchString' property of the output objects produced by all child nodes.
Example:
sequence
string literal
One of the Hamiltons in the world
string literal
is a city
string literal
in Ontario,
string literal
which is a province of Canada.
In this example, this parser description will match with any string that starts with 'One of the Hamiltons in the world is a city in Ontario, which is a province of Canada.'
Description:
This function will have one or more child expressions and will return a match if any of its children match with the input string.
Match condition:
One or more child functions must match against the input string.
Parameters: This is a variadic function and takes in any number of child functions as parameters.
Output object properties:
- matchFound: true if any of the child nodes match with the input string.
- matchString: equal to the matchString property of the first child node that matches with the input string.
Example:
or
string literal
a
character class
xyz
This example parser specification will match with either a string that starts with 'a', or a string that is made from only the characters 'x', 'y', and 'z'.
Description:
This function has multiple child nodes and will produce a match if all of its child nodes produce a match when passed the input string.
Match condition: All child nodes must match with the input string.
Parameters: This is a variadic function.
Output object properties:
- matchFound: true if all child nodes match with the input string. False otherwise.
- matchString: If matchFound is false, the 'matchString' property will be ''. If matchFound is true, then the 'matchString' property is calculated in the following way: Let N be the length of the shortest matchString of all of its children. matchString is equal to the first N characters of the input string.
Example:
and
character class
bc
character class
ab
In this example, the parser specification will match with any string containing one or more copies of the letter 'b'.
Description:
The 'multiple' function is meant to match against an input string multiple times. It takes one child node as a parameter and will return a match if its child node matches with the input string. The child node of the 'multiple' function will continue to match against the unmatched part of the input string repeatedly until no more matches are found. Each time a match is found, it will be will be concatenated with any previously matched strings. The final concatenated match string will be stored in the 'matchString' property of the output object.
Match condition:
The child node must produce a match.
Parameters:
- A function
Output object properties:
- matchFound: true if the child function returns a match. False otherwise.
- matchString:
Example:
multiple
string literal
ab
In this example, the parser specification will match against the strings 'ab', 'abab', 'ababab', or any string made up of multiple copies of the string 'ab'.
Description:
The 'optional' function will always produce a match. Its purpose is to handle syntax that can be either present or absent. If a match is found, the output string will be equal to the matched portion of the input string. If a match is not found, the output string will be the empty string.
Match condition: None. This function always produces a match.
Parameters:
- A function
Output object properties:
- matchFound: true
- matchString: If the child node matches with the input string, this 'matchString' property of the output object will be equal to the child node's output object's 'matchString' property. If the child node does not match with the input string, the 'matchString' property will be set to the empty string.
Example:
sequence
string literal
A
optional
character class
0123456789
In the above example, the input string will match if it is equal to the string 'A'. It will also match if it starts with an 'A' and is followed by one numeral.
Description:
The purpose of the 'split' function is primarily to partition the parser's nodes into multiple branches. The input string will be passed to the first child node. The second, third, or other child nodes will be left unused until called upon by other expressions requiring their presence, such as the 'jump' and 'name' functions.
Match condition: A match is produced if the first child node produces a match.
Parameters:
- A function. The first child node will always be executed.
- All child nodes following the first are optional. If present, these child nodes should be functions.
Output object properties:
- matchFound: If the first child node produces a match, matchFound is true. Otherwise, it is false.
- matchString: Value will be the same result as the first child node's output object's 'matchString' value.
Example:
split
string literal
'x'
string literal
'y'
The above input string will match if it is equal to the string 'A'. The second string literal node will never get used. For an example where the multiple branches of a split function are used, see the example for the 'jump' function.
Description:
This function passes in the input string to its second child node and produces a match if its second child produces a match. The first parameter of the 'name' node is used to associate a string label with it and can be used in conjunction with 'jump' nodes by serving as a target for a 'jump' function.
Match condition: True if its second child node produces a match.
Parameters
- A string.
- A function.
Output object properties:
- matchFound: True if the second child produces a match.
- matchString: Same as the matchString property of the child node's output object.
Example:
name
California
string literal
'A'
In the above example, the string associated with the 'name' node is 'California'. When given the input string 'A', the above example will produce a match. See the 'jump' example to see how the 'split', 'name' and 'jump' nodes act in combination.
Description:
The 'jump' function will cause execution to continue at the 'name' node with the same label as that specified by the first 'jump' function. When execution reaches a 'jump' function, the input string will be passed to the 'name' node that serves as the jump target.
Match condition: Will produce a match if the 'name' node with the matching label produces a match.
Parameters:
- A string. Must match with the string label of a 'name' function.
Output object properties:
- matchFound: True if the 'name' node jump target produces a match when passed the input string.
- matchString: Same as the 'matchString' property of the output object of the 'name' node serving as the jump target.
Example usage:
split
jump
California
name
California
string literal
'A'
In the above example, the input string is first passed to the 'split' node, then to the 'jump' node. At this point, the 'jump' node specifies a child node name of 'California'. During matching, execution will jump to the second child node of the split node because it is a name node with the label 'California'. The 'California' name node will then execute its second child node's matching function, which detects an 'A'. Thus, the above example essentially returns a match if the input string starts with the string literal 'A'.