The two main code branches, the rust
and haskell
branches, contain
versions of Weighted Regular Expression as described in the paper A
Play on Regular Expressions: A Functional Pearl
(paper,
slides), by
Sebastian Fischer, Frank Huch, and Thomas Wilke. The early Haskell
versions are faithful to the paper, and are more or less what's
presented in the paper, entered by hand (as that's pretty much the only
way anything gets into my brain these days). The Rust versions are a
ports of the Haskell version, inefficiencies and all, in order to see
what it takes to take academic results and reimplement them in what is
presented as a "systems" language.
The paper uses the term "weighted" to describe the post-processing they do with an algebraic construct known as a semiring, which exploits categorical similarities between the set of booleans, the set of ordinary numbers, and other data structures to turn analyzing the return value of a successful recognition into a parseable dataset all its own. I'm using the word "rig" as it's shorter; a semiring is related to the algebraic construct called a "ring," but cannot handle negative elements, and so... a "rig."
I'm interested in the ideas of Fritz Henglein's dream of extracting all possible parsing information via a catamorphism over the parse tree, and that's exactly what semiring does. I'm evaluating the use of the semiring over Might's parser combinator approach to see which is easier to implement and provides the most bang for the buck. So having regular expressions "rigged" just seemed like the right term.
Later implementations in both branches apply the Play strategies to Brzozowski's "derivatives of regular expressions" algorithm, and demonstrates that the results hold.
This proves what I set out to show, that Goodman's 1979 paper on Semiring Parsing and Brzozowski's Parsing With Derivatives are compatible, and that these results are expressible in a strongly typed systems language like Rust.
Primary mission accomplished.
While the primary goal is complete, the entirety of the project is not. There are several things left until I consider this project done. They are:
- Implement recursive regular expressions
- Apply the equational optimizations for efficiency.
- Supply an aggregating semiring so that multiple and correlative semantic extractions can happen at the same time.
- Supply Adams's LoL: Language of Languages as an interface layer to a richer parsing experience.
These are probably separate projects, as they're specific meta-syntactic wrappers around the final version:
- Use the semiring implementation to provide a PEG engine
- Use the PEG to provide Rosie in Rust
And then there's a reality that Russ Cox and Geoff Langdale are so far ahead of me that it's not even funny.
The paper describes itself as "a play," and is presented in three acts; the first act has two scenes, the second and third have three scenes. The end of each scene describes a regular expression engine of some efficiency and capability. In total there are eight different regular expression engines implemented.
Currently written:
- Kleene Booleann regular expressions in Haskell (Section I.1)
- Kleene Boolean regular expressions in Rust (Section I.1)
- Kleene Boolean regular expressions in Typescript (Section I.1)
- Rigged Kleene regular expressions in Haskell (Section I.2)
- Rigged Kleene regular expressions in Rust (Section I.2)
- Glushkov construction of efficient Boolean regular expressions in Haskell (Section II.1a)
- Glushkov construction of efficient Boolean regular expressions in Rust, (Section II.1a)
- Brzozowski's Boolean algorithm for regular expressions in Haskell
- Brzozowski's Boolean algorithm for regular expressions in Rust, with the main algorithm in a central function.
- Brzozowski's Boolean algorithm for regular expressions, in Rust, with the main algorithm divvied among the implementation sub-handlers.
- Rigged Kleene regular expressions in Haskell with Parse Forests
- Rigged Glushkov Regular Expressions in Haskell (Section II.1b)
- Rigged Glushkov Regular Expressions in Rust (Section II.1b)
- Rigged Brzozowski's Regular Expressions in Haskell
- Rigged Brzozowski's Regular Expressions in Python
- The Heavyweight Experiments: Glushkov Regex in Haskell with Disambiguation (Section II.2)
- The Heavyweight Experiments: Glushkov Regex in Rust with Disambiguation (Section II.2)
- The Final Experiment: Rigged Brzozowski's Regular Expressions in Rust
- Mission accomplished: Rigged Brzozowski μ-Regular Expressions in Rust
- An exploration of whether or not extended regular expressions (regular expressions supporting the Intersection, Negation, and Interleaf operations) is theoretically sound.
- An implementation of a Rigged Brzozowski Algorithm with Adams's and Darais's optimizations.
- An implementation of a Rigged Brzozowski Algorithm with Might's recursion enabled, using Adams's laziness as its basis. =======
This codebase is a testbed for ideas to be incorporated into the Barre, Brzozowski's Algorithm for Rust Regular Expressions. It is now somewhat idle, as most development will be happening in that project, rather than this one.
The last experiment, Rust 9, represents the culmination of the work. It demonstrates the fundamentals of recursive regular expressions with a mathematically sound foundation for "lifting" the processing of a regular expression into a resultant dataset. Recursive regular expressions with variables form the basis of an extremely powerful context-free grammar parser.
A major goal for this exercise is to illustrate the challenges, and perhaps impossibilities, of porting idiomatic Haskell to idiomatic Rust. The regular expressions being built are static and compiled at compile-time with the rest of the code.
The best thing I've learned is that many Haskell idioms do port over
easily to Rust. Looking at the Rust version of accept
in the first
two examples and comparing them to the Haskell, you can readily see just
how straightforward it translates.
On the other hand, Haskell's ability to define list processing
recursively really shines in the definitions of split
and parts
,
which are both three lines long in Haskell, but 21 and 29 lines long
respectively, in Rust.
The other thing is that the "Rigged Kleene regular expressions in
Haskell with Parse Forests" makes me unhappy; I don't have a good
theoretical model for the change I made. It turns out my theoretical
model was just fine; I just didn't know or understand it sufficiently at
the time. The expectation is that yes, you can return an annihilator,
or you can return a value. The default value is "one," but the meaning
of "one" can be whatever you want it to mean, so long as the Semiring
laws hold for the definitions you invent. In short, the value treats
the zero and one as "annihilator" and "identity" operations against your
return value. When the value is primitive, such as in the Boolean or
Integer case, all you need are the zero and one; when the value is
compositional, you need the mul
operator to do the dirty work for you.
And then you remove the 'identity' pass because, well, the CPU doesn't need to waste time on that.
The Haskell code in Haskell Experiments 01, 02, and 04 is straight from the Play paper and therefor belongs to Sebastian Fischer, Frank Huch, and Thomas Wilke. The code in Haskell Experiments 07 and 08 take the fragments offered in the same paper and realize them as full-blown recognizers and parsers.
All other code is Copyright Elf M. Sternberg (c) 2019, and licensed with the Mozilla Public License vers. 2.0. A copy of the license file is included in the root folder.