MiniJava is a subset of Java. It's a simple language for educational purposes.
Here is the grammar that I used to build the compiler. The Compiler in this repo is MiniJava to X64_86.
THIS COMPILER IS FULLY HAND WRITTEN. NO TOOLS WERE USED.
The compiler is still under development here is the stages:
Lexer: finished.
Parser: finished.
AST Building: finished.
Type Checker has to passes:
Symbol Table building: finished.
Checking phase by consulting symbol table: finished.
Code Generator: finished.
$ make
to build MiniJava
from the source you will get MiniJava
executable
run ./MiniJava
with -h
flag to see the help list.
You can Run with the following flags:
-h
to show the help list.-S
to see the program from theLexer
prospective (Tokens).-P
to see the program from theParser
prospective for example:
if we havex + y * z
it becomes(x + (y * z))
.-T
to run the type checker.-C
to run the Code Generator, you will get a file called demo.s
underruntime/
this has the assembly code for your program.- After Generatin the assembly code you can get the executable of your program
by executingmake run
, the executable name isout
run the executable by typing./out
This is a side-project I totally built myself. The project
is for educational purposes. I studied compiler theories, and
I wanted to implement the theories I learned.
I realy thank University of Washington for making CSE P 501 Open source. This course is a great help and helped me a lot.
To understand the code, you don't have to know Compilers' theories.
But here are some highlights, that might help:
- Regular expressions
- Compiler components: Lexer, Parser, Type checking, IR, and Code generation.
- Know the order of compiler phases, and why we use this order?
- What the AST? Why We use it?
- What is the recursive-descent parser?
- Hash tables
- Graph theory
The lexer does the Lexical Analysis. Lexer implementaion is in Lexer.h
and Lexer.cpp
Here are most important functions in the Lexer:
getToken()
: Reads the Token stream, and put each group of
charachters in the suitable group e.g, Key word, operation, ...etc.getNextToken()
: Gets the next token in the input stream.lookahead()
: Retuerns the next token in the input stream
used in Parser as it's LL(1) typeeat(TOK, string)
: Takes two arguments the token, and the string form of it.
Consumes the token if it's the expexted token, or emits an error.
The Parser is hand-crafted LL(1)
. I didn't use any tools to build it,
as I wanted to learn how Parsers are built internally.
We needed to use lookahead()
to check parse variable declarations as
they have the same start tokens as assignment statements see grammar
NOTE: The grammar is not LL but I converted it to be LL.
Files Parser.cpp
, and Parser.h
The compiler Prints Colored error messages. it continues parsing always, but
It exits if the source has 10 errors, couldn't parse the main class,
If there's an error in a class, or method declaration, the compiler skips
the class, or the method, but it emits a message of the error and a note
that it's skipping the operation of the current class or method.
If there is an error in a variable declaration, it skips till the next
semi-colon.
You can see the source program from the parser view by passing the -P
flag.
The Error message has a correct location for the parse error, but
if the error is the last charchter in the line it expects the correct token to
be in the next line for example:
1 int x
2 int z;
output:
expected `;` at line 2
it expectes ;
instead of int.
The implementation is in Error.h
and Error.cpp
The Parser has two important jobs:
- Check the syntax and make sure it's error free.
- Build the AST.
The AST is very important we use it in type-checking, and code generation.
AST implementaion is inAST.h
, andAST.cpp
.
NOTE: For Any AST traversing in the compiler I used visitor pattern: - Print the source from the parser view.
- Building The symbol table.
- Type-checking.
- Code generation.
After done with parsing we should check the semantics of the source code.
We do this by two passes:
- Build a symbol table of every symbol in the program, and store its scope.
- Perform the type-checking consulting the symbol table.
We use the builtin hashmap std::map
for the symbol table enteries.
This phase is seriously challenging assume the following problems:
- Variables and methods' Scopes, and overriding.
Assume the following case:
The compiler must handle the above case perfectly. in
class A { int a; int fii() {int a; return a;} int tii() {retuen 0;} } class B extends A { int a; int foo(int a) {return a;} int fee() {return a;} int tii() {return 6;} int a() {return a;} }
SymbolTable/SymbolTable.h
andSymbolTable/SymbolTable.cpp
, you see
the interface, and implementation of symbol table.
but the functions that interests you areenterScope(string)
, and
exitScope()
, these functions store the information of every scope it
hits. before entering a new scope,enterScope
creates a Symbol table
entry, store the parent scope for the current scope and start storing it's
information.
After done with the current scopeexitScope
return the control to the
parent scope. - Cyclic inheritance.
Assume the following case:
How can we know this? the solution is simple, just build a graph using the class relations. every node is a class, and every directed edge represents
class A extends B {} class B extends C {} class C extends A {}
extends
, and check for the cycles using simpledfs
.- Class appears before it's used. This problem doesn't exist in C / C++, for example:
orclass A extends B {} class B {}
How can we know this? answer:Class A {B b;} Class B {A a;}
- For the first case, class
B
must be in the symbol table before class
A
, so we usetopological sort
to know the order. - For the second case, If class
B
doesn't use classA
object,
we could usetopological sort
, but there's a cycle in declaration.
how could we solve this? answer, just use amap
e.g,map<string, bool>
and before building the symbol table just store all classes.
If we find the class in map then we don't emit an error. For example:
the map has two values: {{A, true}, {B, true}}, the we start building the
symbol table.
Merging the above two solutions, sort, and store, we can build the symbol table
successfully.
After building the Symbol table this phase becomes easy. Just use
getScope
, and exiteScope
for every scope you hit. The type-checker
consults the symbol table about variables and classes, e.g:
- Return type mismatch.
- Use of undefined variable.
- Check if the object has the method e.g,
a.foo()
. - Declare variable of undefined types e.g:
class A {B b;}
- non-integer array index.
- non-boolean
if
expression. - and many more...
You can find The Symentic Phase code in:
SymbolTable/
, TypeCheckerVisitor.h
, TypeCheckerVisitor.cpp
,
and TypeVisitor.h
This phase was the most difficult although it seems easy, but the generated
code had many Segmentaion fault
errors most of them was because of stack alignment
.
The technique used here is so simple traverse the AST in execution order
and emit the code in visitor methods.
One strategy that made the implementation easy is to treat X86 as a 1-register
language with a stack for intermediate values all operations writes %rax
register
The code however uses other register for some operations but not much.
function calls was a little bit trickt to implement but the code uses O(1) dynamic dispatch.
you can find the code in CodeGeneratorVistor.h
and CodeGeneratorVisitor.cpp
The MiniJava program should have runtime environment e.g space for heap, and stack,
%rsp
and the other registers should have correct initial values, and a way
to allocate memory for objects, and arrays.
We use the existing C
runtime, under directory runtime/
you will find
a C
program boot.c
. This program calls the MiniJava main function
as if it were a C
function. By using this technique, it's possible
to call standard C
functions, and the program will now use the C
ennvironment.
I wrote All files in this project except runtime
, and SamplePrograms
they are from CSE P 501
starter code, however you can use this code freely.