This project is my own implementation of a regular expression engine. It's built from scratch, using only theory as reference.
The following are the things I want to learn from doing this project:
- How regular expressions are converted to state machines
- How to convert non-deterministic state machines into deterministic ones
- Figure out how to adapt the ASCII-only ideas in the textbook I read to work with the full Unicode range1
And the following are goals I have for my implementation:
- Write everything by hand and from scratch
- Handle Unicode, not just ASCII1
- Handle the following regex operators:
-
()- grouping -
[]- character sets -
|- alternation -
?- zero or one -
*- zero or more -
+- one or more
-
- Output the finite state machines in GraphViz DOT format for visualization (see Screenshots)
- Track the text matched by capture groups
Deterministic finite state machine (DFA) generated for the regular expression pattern a(x|y)*c:
Conversion of a non-deterministic finite state machine (NFA) to a deterministic
one (DFA):
These are the resources I used to learn about regular expression matching and finite state machines.
- Compilers: Principles, Practice, and Tools
- Sections 3.6 and 3.7
- Regular Expression Matching Can Be Simple And Fast
Footnotes
-
Many textbooks and learning resources for regular expressions and state machines assume that the input is ASCII. This greatly simplifies some of the logic, as you can create edges/transitions in the state machines for each individual character since there are only 127 possibilities. Modern software uses Unicode, not just ASCII, so I wanted to figure out on my own how I could make a state machine work with Unicode. Instead of creating an edge/transition for each individual character (since there are up to 1,111,998 possibilities), I created transitions for ranges, and broke these ranges into several disjoint ranges as part of the NFA-to-DFA conversion. ↩ ↩2