A lexer, parser, assembly generator and a code emitter for a multi-language, multi-stage compiler that transforms a minimal C-like language into platform-aware x64 assembly.
This project demonstrates tokenization of a c program, parsing of tokens, assembly generation and x64 assembly code emission, which is a part of a full compiler pipeline, starting from a C source file and ending with a runnable assembly file. It is composed of two main repositories working in concert:
- obvcc-lexer (Rust): A lexical analyzer (lexer or tokenizer) for a simplified C-like programming language, implemented in Rust. It reads source code from a file (or a default string) and produces a stream of tokens, which are output as a JSON array to standard output. If errors are encountered during lexing, a JSON object describing the error is produced.
- obvcc-toolchain (OCaml): This repository consumes the lexer's output and performs all subsequent stages: parsing, semantic analysis, and assembly code generation.
The primary goal is to build and understand a complete, working toolchain, leveraging the strengths of both Rust (for performance-critical I/O) and OCaml (for its powerful type system and elegance in Abstract Syntax Tree transformations).
- The Compiler Pipeline
- The Compilation Stages
- Current Compiler Capabilities
- Project Architecture
- Getting Started
- Error Handling
- Potential Extensions
- License
This project implements a classic sequential pipeline, where the output of each stage becomes the input for the next, transforming the code from a high-level language into low-level machine instructions.
graph TD;
A["C Source File (.c)"] -->|"Lexing (Rust)"| B("Token Stream JSON");
B -->|"Parsing (OCaml)"| C{"C Language AST"};
C -->|"Assembly Generation (OCaml)"| D{"x64 Assembly AST"};
D -->|"Code Emission (OCaml)"| E["Formatted Assembly File (.s)"];
E -->|"Assembling (GCC/Clang)"| F("Executable");
Each stage performs a distinct transformation on the program's representation.
The process begins with the obvcc-lexer Its job is to scan the raw source text and break it into a stream of atomic units called tokens.
- Input: C source code (e.g.,
int main() { ... }). - Process: Identifies keywords (
int), identifiers (main), punctuation ({,}), and literals (42). - Output: A JSON array of these tokens, ready for the next stage.
-
Tokenizes basic C-like constructs:
- Keywords:
int,void,return - Identifiers: (e.g.,
main,variableName,_foo123) - Integer Constants: (e.g.,
0,123,42) - Punctuation:
(,),{,},;
- Keywords:
-
Skips Whitespace: Ignores spaces, tabs, and newlines between tokens.
-
Handles Comments:
- Single-line comments:
// ... until end of line - Multi-line comments:
/* ... can span multiple lines ... */(non-nested)
- Single-line comments:
-
Error Reporting: Produces structured JSON output for lexical errors, including:
UnexpectedCharacter: When a character is found that cannot start any known token.InvalidInteger: When a numeric literal is malformed or out of range (fori32).NoMatch: A fallback for when no token rule applies at a position.
-
JSON Output: Outputs the token stream or error information in JSON format for easy interoperability with other tools or compiler stages written in different languages.
-
Modular Design: The lexer logic is organized into sub-modules for clarity:
token.rs: Defines theTokenenum.error.rs: Defines theLexerErrorenum.core.rs: Contains theLexerstruct and core tokenization logic.mod.rs: Aggregates the lexer module and provides its public API.
The parser gives structure to the flat stream of tokens. It verifies that the tokens conform to the language's formal grammar and builds a hierarchical representation of the code.
The parser's primary responsibilities are:
- Input: To consume a linear sequence of tokens produced by the lexer.
- Syntactic Analysis: To verify that this sequence of tokens conforms to the formal grammar of the source language. If it does not, the parser reports a syntax error.
- Output: To produce a tree-like data structure called an Abstract Syntax Tree (AST) that represents the hierarchical structure and meaning of the source code.
This project implements a recursive descent parser, which is a top-down parsing strategy. The core idea is to create a set of mutually recursive functions, where each function is responsible for parsing one non-terminal symbol from the language's grammar. This approach makes the parser code a direct, readable reflection of the formal grammar.
The construction of this parser relies on three distinct but related artifacts:
1. The AST Definition (ASDL)
The Abstract Syntax Tree (AST) is the output of the parser. It's a hierarchical representation of the program's meaning, stripped of syntactic details like semicolons and braces.
ASDL for this project:
program = Program(function_definition) function_definition = Function(identifier name, statement body) statement = Return(exp) exp = Constant(int)In our OCaml project, this is implemented using algebraic data types in
lib/ast.ml.
2. The Formal Grammar (EBNF)
The formal grammar defines the concrete syntax of the language. It guides the parser's logic by specifying exactly which tokens are required and in what order.
EBNF for this project:
<program> ::= <function> <function> ::= "int" <identifier> "(" [ "void" ] ")" "{" <statement> "}" <statement> ::= "return" <exp> ";" <exp> ::= <integer_literal>
3. The Parser Implementation
The parser's code is the bridge between the formal grammar and the AST. It consumes tokens according to the grammar rules and constructs the corresponding AST nodes.
Pseudocode for
parse_statement:parse_statement(tokens): expect("return", tokens) // Consume a terminal from the grammar return_val = parse_exp(tokens) // Call a function for a non-terminal expect(";", tokens) // Consume another terminal return Return(return_val) // Construct and return an AST node
This is the first step of translation towards the target machine. The assembly generator walks the high-level C AST and translates it into a lower-level, structured representation of assembly code.
This stage is another transformation from one tree structure to another. It is defined by two key artifacts: the target Assembly AST and the translation logic.
1. The Assembly AST Definition (ASDL)
This defines the structure of our assembly language representation. Keeping assembly in this structured form (instead of printing strings directly) allows for easier manipulation and platform-specific logic.
ASDL for the Assembly AST:
program = Program(function_definition) function_definition = Function(identifier name, instruction* instructions) instruction = Mov(operand src, operand dst) | Ret operand = Imm(int) | RegisterIn our OCaml project, this is implemented in
lib/assembly_ast.ml.
2. The Translation Logic
The assembly_generator.ml module contains the functions that traverse the C AST and build the Assembly AST. A key concept is that a single, high-level C node may expand into multiple low-level assembly instructions.
Example Translation: A
Returnnode from the C AST...(* C AST Node *) S_Return (E_Constant 42)...is translated into a list of assembly instructions:
(* Assembly AST Nodes *) [ I_Mov (A_Imm 42, A_Register); I_Ret ]This corresponds to
movl $42, %eaxfollowed byret.
The code emitter performs the final translation. It traverses the structured Assembly AST and prints the final, formatted text.
- Input: The Assembly AST.
- Process: Each node of the Assembly AST is converted into its textual x64 assembly equivalent. This stage handles platform-specific syntax like
_mainfor macOS and.note.GNU-stackfor Linux. - Output: A complete, formatted assembly file (
.s) ready to be assembled.
The compiler can successfully process programs that have the following structure:
- A single
intfunction with an identifier (e.g.,main). - The function's parameter list can be
()or(void). - The function body must contain exactly one
returnstatement. - The
returnstatement must return a single integer literal.
int main(void) {
return 42;
}The output for this code will be a runnable assembly file that, when executed, produces an exit code of 42.
This Rust + OCaml project is structured as a directory which contains two seperate directories (obvcc-lexer and obvcc-toolchain). The obvcc-lexer directory contains rust lexer which uses Cargo as the package manager and build system and the obvcc-toolchain directory contains the ocaml toolchain which used the Dune build system.
obvcc/
├── obvcc-lexer/
│ ├── Cargo.toml # Rust project configuration, dependencies
│ ├── src/ # Source code directory
│ │ ├── main.rs # Application entry point, handles I/O and orchestrates the lexer
│ │ └── lexer/ # Lexer module directory
│ │ ├── mod.rs # Lexer module entry point, re-exports public items, tests
│ │ ├── token.rs # Token enum definition
│ │ ├── error.rs # LexerError enum definition
│ │ └── core.rs # Lexer struct and core tokenization logic
│ └── target/ # Build artifacts (generated by `cargo build`)
└── obvcc-toolchain/
├── bin/ # Executable code
│ ├── dune
│ └── main.ml # Entry point: orchestrates the full pipeline
├── lib/ # Library code (reusable)
│ ├── ast.ml # Defines C AST, token types, and exceptions
│ ├── ast.mli # Public interface for the Ast module
│ ├── assembly_ast.ml # Defines the x64 Assembly AST types
│ ├── assembly_generator.ml # Logic to convert C AST -> Assembly AST
│ ├── code_emitter.ml # Logic to convert Assembly AST -> Assembly String
│ ├── dune
│ ├── parser.ml # The recursive descent parsing logic
│ ├── parser.mli # Public interface for the Parser
│ ├── token_stream.ml # Deserializes JSON into a token list
│ └── token_stream.mli # Public interface for the Token_stream
├── dune-project
└── obvcc_toolchain.opam # Project metadata and dependencies
Run the project on Gitpod with instant setup:
- Rust and Cargo (for the lexer).
- OCaml (version 5.0+) and OPAM (the OCaml Package Manager).
- The Dune build system.
- A C compiler like
gccorclang(for assembling the final output).
-
Clone this repository:
git clone https://github.com/0bVdnt/obvcc.git cd obvcc -
Build the Rust Lexer:
cd obvcc-lexer cargo build --release cd ..
This command will create a
obvcc-lexerexecutable in thetarget/releasedirectory -
Install OCaml Dependencies and Build the Compiler:
cd obvcc-toolchain opam install dune yojson dune buildThis command will create the
main.exeexecutable inside the_builddirectory.
The full pipeline involves two main steps: running the lexer, then running the OCaml toolchain on its output.
-
Run Lexer: Create a source file
test.cand run the lexer, saving the output to a JSON file.# From the parent directory of both repos ./obvcc-lexer/target/release/obvcc-lexer ./obvcc-toolchain/test.c > ./obvcc-toolchain/test.json
-
Run OCaml Compiler: Navigate into the
obvcc-toolchainrepo and run the executable on the generated JSON file.cd obvcc-toolchain dune exec -- ./bin/main.exe test.json
This will generate a
test.sassembly file. -
Assemble and Run: Use
gccorclangto create a final executable from the assembly file.gcc -o test_executable test.s ./test_executable echo $? # Should print the value from your return statement
The compiler is designed to fail gracefully at each stage:
- Lexer: Reports errors if the source code contains invalid characters.
- Parser:
- Raises a
DeserializationErrorif the input is not valid JSON or if the lexer reported an error. - Raises a
SyntaxErrorif the token stream violates the language grammar.
- Raises a
- All errors are printed to standard error (
stderr), and the program terminates with a non-zero exit code.
This compiler is a solid foundation that can be extended to support a more complex language. Future work could include:
- Parsing and generating code for complex expressions (binary/unary operators).
- Adding support for variables, scope, and assignment.
- Implementing control flow statements like
if/elseand loops. - Parsing function arguments and multiple functions.
This project is licensed under the MIT License.