Rigged Regular Expressions

The two main code branches, the rust and haskell branches, contain versions of Weighted Regular Expression as described in the paper A Play on Regular Expressions: A Functional Pearl (paper, slides), by Sebastian Fischer, Frank Huch, and Thomas Wilke. The early Haskell versions are faithful to the paper, and are more or less what's presented in the paper, entered by hand (as that's pretty much the only way anything gets into my brain these days). The Rust versions are a ports of the Haskell version, inefficiencies and all, in order to see what it takes to take academic results and reimplement them in what is presented as a "systems" language.

The paper uses the term "weighted" to describe the post-processing they do with an algebraic construct known as a semiring, which exploits categorical similarities between the set of booleans, the set of ordinary numbers, and other data structures to turn analyzing the return value of a successful recognition into a parseable dataset all its own. I'm using the word "rig" as it's shorter; a semiring is related to the algebraic construct called a "ring," but cannot handle negative elements, and so... a "rig."

I'm interested in the ideas of Fritz Henglein's dream of extracting all possible parsing information via a catamorphism over the parse tree, and that's exactly what semiring does. I'm evaluating the use of the semiring over Might's parser combinator approach to see which is easier to implement and provides the most bang for the buck. So having regular expressions "rigged" just seemed like the right term.

Later implementations in both branches apply the Play strategies to Brzozowski's "derivatives of regular expressions" algorithm, and demonstrates that the results hold.

This proves what I set out to show, that Goodman's 1979 paper on Semiring Parsing and Brzozowski's Parsing With Derivatives are compatible, and that these results are expressible in a strongly typed systems language like Rust.

Primary mission accomplished.

Secondary mission

While the primary goal is complete, the entirety of the project is not. There are several things left until I consider this project done. They are:

Implement recursive regular expressions
Apply the equational optimizations for efficiency.
Supply an aggregating semiring so that multiple and correlative semantic extractions can happen at the same time.
Supply Adams's LoL: Language of Languages as an interface layer to a richer parsing experience.

These are probably separate projects, as they're specific meta-syntactic wrappers around the final version:

Use the semiring implementation to provide a PEG engine
Use the PEG to provide Rosie in Rust

And then there's a reality that Russ Cox and Geoff Langdale are so far ahead of me that it's not even funny.

Status

The paper describes itself as "a play," and is presented in three acts; the first act has two scenes, the second and third have three scenes. The end of each scene describes a regular expression engine of some efficiency and capability. In total there are eight different regular expression engines implemented.

Currently written:

Kleene Booleann regular expressions in Haskell (Section I.1)
Kleene Boolean regular expressions in Rust (Section I.1)
Kleene Boolean regular expressions in Typescript (Section I.1)
Rigged Kleene regular expressions in Haskell (Section I.2)
Rigged Kleene regular expressions in Rust (Section I.2)
Glushkov construction of efficient Boolean regular expressions in Haskell (Section II.1a)
Glushkov construction of efficient Boolean regular expressions in Rust, (Section II.1a)
Brzozowski's Boolean algorithm for regular expressions in Haskell
Brzozowski's Boolean algorithm for regular expressions in Rust, with the main algorithm in a central function.
Brzozowski's Boolean algorithm for regular expressions, in Rust, with the main algorithm divvied among the implementation sub-handlers.
Rigged Kleene regular expressions in Haskell with Parse Forests
Rigged Glushkov Regular Expressions in Haskell (Section II.1b)
Rigged Glushkov Regular Expressions in Rust (Section II.1b)
Rigged Brzozowski's Regular Expressions in Haskell
Rigged Brzozowski's Regular Expressions in Python
The Heavyweight Experiments: Glushkov Regex in Haskell with Disambiguation (Section II.2)
The Heavyweight Experiments: Glushkov Regex in Rust with Disambiguation (Section II.2)
The Final Experiment: Rigged Brzozowski's Regular Expressions in Rust
Mission accomplished: Rigged Brzozowski μ-Regular Expressions in Rust

Todo

An exploration of whether or not extended regular expressions (regular expressions supporting the Intersection, Negation, and Interleaf operations) is theoretically sound.
An implementation of a Rigged Brzozowski Algorithm with Adams's and Darais's optimizations.
An implementation of a Rigged Brzozowski Algorithm with Might's recursion enabled, using Adams's laziness as its basis. =======

Caveat

This codebase is a testbed for ideas to be incorporated into the Barre, Brzozowski's Algorithm for Rust Regular Expressions. It is now somewhat idle, as most development will be happening in that project, rather than this one.

The last experiment, Rust 9, represents the culmination of the work. It demonstrates the fundamentals of recursive regular expressions with a mathematically sound foundation for "lifting" the processing of a regular expression into a resultant dataset. Recursive regular expressions with variables form the basis of an extremely powerful context-free grammar parser.

Lessons learned

A major goal for this exercise is to illustrate the challenges, and perhaps impossibilities, of porting idiomatic Haskell to idiomatic Rust. The regular expressions being built are static and compiled at compile-time with the rest of the code.

The best thing I've learned is that many Haskell idioms do port over easily to Rust. Looking at the Rust version of accept in the first two examples and comparing them to the Haskell, you can readily see just how straightforward it translates.

On the other hand, Haskell's ability to define list processing recursively really shines in the definitions of split and parts, which are both three lines long in Haskell, but 21 and 29 lines long respectively, in Rust.

~~The other thing is that the "Rigged Kleene regular expressions in Haskell with Parse Forests" makes me unhappy; I don't have a good theoretical model for the change I made.~~ It turns out my theoretical model was just fine; I just didn't know or understand it sufficiently at the time. The expectation is that yes, you can return an annihilator, or you can return a value. The default value is "one," but the meaning of "one" can be whatever you want it to mean, so long as the Semiring laws hold for the definitions you invent. In short, the value treats the zero and one as "annihilator" and "identity" operations against your return value. When the value is primitive, such as in the Boolean or Integer case, all you need are the zero and one; when the value is compositional, you need the mul operator to do the dirty work for you.

And then you remove the 'identity' pass because, well, the CPU doesn't need to waste time on that.

LICENSE

The Haskell code in Haskell Experiments 01, 02, and 04 is straight from the Play paper and therefor belongs to Sebastian Fischer, Frank Huch, and Thomas Wilke. The code in Haskell Experiments 07 and 08 take the fragments offered in the same paper and realize them as full-blown recognizers and parsers.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
docs		docs
haskell		haskell
node		node
python		python
rust		rust
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rigged Regular Expressions

Secondary mission

Status

Todo

Caveat

Lessons learned

LICENSE

About

Releases

Packages

Languages

License

elfsternberg/riggedregex

Folders and files

Latest commit

History

Repository files navigation

Rigged Regular Expressions

Secondary mission

Status

Todo

Caveat

Lessons learned

LICENSE

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages