Skip to content

mesut146/parserx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

parserx

lexer & parser generator and grammar toolkit written in java

Features

  • accepts regex like grammar(EBNF)
  • epsilon removal
  • left recursion removal(direct and indirect)
  • left factoring
  • ebnf to bnf
  • LR(0),LR(1),LALR(1) parser generator
  • Table based parser & State->Method based parser
  • Outputs AST/CST
  • LL(1) recursive descent parser generator
  • dot graph of NFA, DFA, LR(0), LR(1), LALR(1)
  • DFA minimization
  • lexer generator
  • precedence tool(removes any precedence conflict)

Examples are in examples folder

Grammar Format

comments

//this is a line comment

/* this is a
multine comment */

top level

to include another grammar use;

include "<grammar_name>"

e.g include "lexer.g"

options

options{
  <option_name> = <value>
  ...
}

token definitions

token{

  <TOKEN_NAME> <seperator> <regex> <SEMICOLON>
  //where seperator is one of ':' , '=' , '::=' , ':=' , '->'
}

e.g

token{
  NUMBER: [0-9]+;
  IDENT: [a-zA-Z_] [a-zA-Z0-9_]*;
}

prefixing token name with '#' makes that token fragment.So that it can be used as reference but no actual dfa generated for it

rule definitions

<RULE_NAME> <seperator> <regex> <SEMICOLON>

e.g

assign: left "=" right;
left: ident;
right: ident | literal;

regex types

alternation

r1 | r2 | r3

sequence

r1 r2 r3

repetition

r* = zero or more times(kleene star)
r+ = one or more times(kleene plus>
r? = zero or one time(optional)

grouping

(r) you can group complex regexes in tokens and rules
e.g a (b | c+)

epsilon

use %empty, %epsilon or ε for epsilon
e.g rule: a (b | c | %epsilon);

ranges (token only)

place ranges or single chars inside brackets(without quote)
[start-end single]

e.g id: [a-zA-Z0-9_];

escape sequences also supported
e.g ws: [\u00A0\u000A\t];

negation e.g lc: "//" [^\n]*;

strings

use double quotes for your strings
e.g stmt: "if" "(" expr ")" stmt;

strings in rules will be replaced with token references that are declared in token block
so in the example above the strings would need to be declared like;

token{
  IF: "if";
  LP: "(";
  RP: ")";
}

start directive

in LR parsing you have to specify start rule with %start
e.g %start: expr;

assoc directives

%left <TOKEN_LIST>
%right <TOKEN_LIST>

precedence

precedence handled by picking the production declared previously e.g E: E "*" E | E "+" E | NUM;
multiplication takes precedence over addition in the example aabove

skip block

skip tokens will be ignored by the parser so you can use it for comments and whitespaces

skip{
  comment: "//" [^\n]*;
}

Packages

No packages published

Languages