A minimal pseudo-language compiler intended purely for teaching and demonstration.
Note: You can consult zread for a quick start guide (available in English or Simplified Chinese), but please avoid over-reliance on zread. AI-generated summaries may occasionally be affected by hallucinations and deviate from actual content. Refer to the documentation and doxygen comments for the most accurate information.
This project is derived from a distributed educational compiler framework, but has since been fully refactored to demonstrate a modern C++20-style compiler pipeline with value semantics, closed ADTs, and explicit memory management.
⚠️ Educational use onlyThis repository is not intended for production, distribution, or real-world language implementation.
If you need to use AI to analyze your repository, emphasize the importance of Doxygen and documentation in your prompt to avoid AI overinterpreting implementation code. This is a complete antipattern in modern C++ (mcpp) and is misleading. C++ emphasizes that the API defines semantic boundaries, and documentation describes behavior. You should not overemphasize implementation details, as they offer exactly no practical benefit to your usage (understanding implementation details is relevant for rewriters, and even maintainers should not obsess over them).
PseudoCompiler demonstrates a complete but minimal compilation pipeline:
- Lexical analysis (Flex)
- Parsing (Bison)
- AST construction
- IR (Intermediate Representation) generation
- Assembly code generation (x86-64 NASM-style)
The language and implementation are intentionally constrained to keep the design fully understandable and mechanically transparent.
This compiler implements a very small pseudo language with deliberately limited features.
See Grammar for details
Only two built-in types exist:
intstring
There are no user-defined types, no arrays, no structs, and no pointers.
int a;
int b = 3;
string s = "hello";- Declarations may list multiple identifiers
- Initialization is allowed only for single-variable declarations
- All variables are global in scope
Supported expressions include:
-
Integer literals
-
Identifiers
-
Binary arithmetic:
+,-,*,/
-
Simple comparisons:
==,!=,<,<=,>,>=
if (a < b) {
print(a);
} else {
print(b);
}
while (a < 10) {
a = a + 1;
}Supported constructs:
if / elsewhile
No for, no break, no continue.
print(123);
prints("hello");print(int)prints(string_variable)orprints(string_literal)
Internally, printing is modeled as a closed two-state operation rather than a string-dispatched command. Since only two print modes exist (integer vs string), the compiler represents the print kind using a small enum instead of string tags, avoiding unnecessary string comparisons during IR generation and code emission.
String literals are extremely restricted:
-
Only raw characters are supported
-
No escape sequences
- ❌
\n - ❌
\t - ❌
\xNN - ❌
\u{...}
- ❌
Strings are treated as verbatim byte sequences terminated by '\0'.
This is intentional: the goal is to demonstrate compiler structure, not string parsing complexity.
The original teaching framework relied on:
- Open inheritance hierarchies
- Virtual dispatch
- RTTI (
dynamic_cast,typeid) - Heap-heavy object graphs
This repository no longer follows that model.
All major language constructs are represented using closed algebraic data types:
ASTNode→std::variant<...>IRInstr→std::variant<...>
Dispatch is performed using:
std::visitif constexpr- Compile-time type resolution
As a result:
- RTTI is completely removed
- The compiler is built with
-fno-rtti - No virtual functions or vtables exist in AST or IR
The IR layer is designed as pure data:
- IR instructions are immutable value objects
- Equality and hashing are explicitly defined
- No hidden ownership or polymorphic behavior
This makes the IR:
- Deduplicable
- Cache-friendly
- Deterministic
To address allocation pressure and fragmentation, the compiler integrates JH-Toolkit:
-
IR instructions are stored in
jh::conc::flat_pool -
This provides:
- Arena-like contiguous storage
- Key-based deduplication
- GC-like reuse semantics
-
No explicit pool shrinking is performed:
- Compiler processes typically run once and exit
- Capacity growth reflects legitimate workload demand
This eliminates IR-level heap fragmentation while keeping semantics simple.
- GNU C++20 (
-std=gnu++20)
- Apple Clang ≥ 15
- LLVM Clang 20
- GCC ≥ 13 (GCC 14.3+ recommended)
- MinGW-based versions of the above (UCRT recommended for Windows)
Note: The output assembly of the pseudo compiler (executable build target of this project)
is designed for x86-64 NASM-style syntax and may not be compatible with non-x86-64 targets or assemblers with different syntax expectations.
Please use a x86-64 linux in docker or VM if your host environment is incompatible.
- ❌ MSVC
MSVC is not supported due to incomplete or divergent behavior in modern C++20 features and ABI expectations required by this project.
- Flex
- Bison
- CMake ≥ 3.16
- Ninja (recommended)
-
JH-Toolkit
- License: Apache License 2.0
- Repository: https://github.com/JeongHan-Bae/JH-Toolkit
JH-Toolkit is used only as a low-level utility library and does not impose runtime dependencies.
Note: up to now, the used modules are still in dev branch (1.4.0-dev) of JH-Toolkit.
The branch is almost stable but may have minor changes before the official release.
Once released, the dependency will be updated accordingly.
See NOTICE for details.
mkdir -p build
cd build
cmake -G Ninja ..
ninja./compiler [options]-
-src <path>Source file path. If specified, a path argument is required. -
-target <path>Output assembly file path. If specified, a path argument is required. -
--astPrint AST. -
--irPrint IR.
If not specified:
-src read.txt
-target out.asmPaths are resolved relative to the execution directory.
The compiler:
-
Reads input from source file
-
Writes assembly output to target file
-
Prints AST and/or IR if requested
-
Waits for user input after each run
- Enter
q;to exit - Any other input recompiles
read.txt
- Enter
PseudoCompiler/
├── .github/
│ └── workflows/
│ └── ci.yml # GitHub Actions CI configuration
├── include/
│ ├── ast.hpp # AST definitions (variant-based)
│ ├── codegen.hpp # Assembly code generator
│ ├── ir.hpp # IR definitions + flat_pool integration
│ └── tokens.hpp # Lexer token definitions
├── src/
│ ├── codegen.cpp
│ ├── ir.cpp
│ ├── main.cpp
│ ├── parser.yy
│ └── scanner.l
├── CMakeLists.txt
├── read.txt
├── expected.txt
├── LICENSE
├── NOTICE.md
└── README.md
This repository is licensed under the MIT License.
Third-party dependency JH-Toolkit is licensed
under Apache License 2.0.
See the NOTICE file for details.
This project exists to demonstrate:
- A minimal but complete compiler pipeline
- Modern C++20 design using value semantics
- Why closed ADTs outperform open OOP hierarchies for compilers
- Practical memory management without GC or RTTI
It is intentionally small, strict, and limited by design.