Skip to content

kdkasad/regex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Regular expressions from scratch

This project is my own implementation of a regular expression engine. It's built from scratch, using only theory as reference.

Goals

The following are the things I want to learn from doing this project:

  • How regular expressions are converted to state machines
  • How to convert non-deterministic state machines into deterministic ones
  • Figure out how to adapt the ASCII-only ideas in the textbook I read to work with the full Unicode range1

And the following are goals I have for my implementation:

  • Write everything by hand and from scratch
  • Handle Unicode, not just ASCII1
  • Handle the following regex operators:
    • () - grouping
    • [] - character sets
    • | - alternation
    • ? - zero or one
    • * - zero or more
    • + - one or more
  • Output the finite state machines in GraphViz DOT format for visualization (see Screenshots)
  • Track the text matched by capture groups

Screenshots

Deterministic finite state machine (DFA) generated for the regular expression pattern a(x|y)*c: a(x|y)*c

Conversion of a non-deterministic finite state machine (NFA) to a deterministic one (DFA): non-deterministic state machine deterministic state machine

References

These are the resources I used to learn about regular expression matching and finite state machines.

Footnotes

  1. Many textbooks and learning resources for regular expressions and state machines assume that the input is ASCII. This greatly simplifies some of the logic, as you can create edges/transitions in the state machines for each individual character since there are only 127 possibilities. Modern software uses Unicode, not just ASCII, so I wanted to figure out on my own how I could make a state machine work with Unicode. Instead of creating an edge/transition for each individual character (since there are up to 1,111,998 possibilities), I created transitions for ranges, and broke these ranges into several disjoint ranges as part of the NFA-to-DFA conversion. 2

About

A from-scratch regular expression engine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages