Skip to content

Weighted Regular Expressions, an experiment in porting an academic Haskell library to Rust

License

Notifications You must be signed in to change notification settings

elfsternberg/riggedregex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language: Haskell Language: Rust Topic: Academic Topic: Regular Expressions Status: In Progress

Rigged Regular Expressions

The two main code branches, the rust and haskell branches, contain versions of Weighted Regular Expression as described in the paper A Play on Regular Expressions: A Functional Pearl (paper, slides), by Sebastian Fischer, Frank Huch, and Thomas Wilke. The early Haskell versions are faithful to the paper, and are more or less what's presented in the paper, entered by hand (as that's pretty much the only way anything gets into my brain these days). The Rust versions are a ports of the Haskell version, inefficiencies and all, in order to see what it takes to take academic results and reimplement them in what is presented as a "systems" language.

The paper uses the term "weighted" to describe the post-processing they do with an algebraic construct known as a semiring, which exploits categorical similarities between the set of booleans, the set of ordinary numbers, and other data structures to turn analyzing the return value of a successful recognition into a parseable dataset all its own. I'm using the word "rig" as it's shorter; a semiring is related to the algebraic construct called a "ring," but cannot handle negative elements, and so... a "rig."

I'm interested in the ideas of Fritz Henglein's dream of extracting all possible parsing information via a catamorphism over the parse tree, and that's exactly what semiring does. I'm evaluating the use of the semiring over Might's parser combinator approach to see which is easier to implement and provides the most bang for the buck. So having regular expressions "rigged" just seemed like the right term.

Later implementations in both branches apply the Play strategies to Brzozowski's "derivatives of regular expressions" algorithm, and demonstrates that the results hold.

This proves what I set out to show, that Goodman's 1979 paper on Semiring Parsing and Brzozowski's Parsing With Derivatives are compatible, and that these results are expressible in a strongly typed systems language like Rust.

Primary mission accomplished.

Secondary mission

While the primary goal is complete, the entirety of the project is not. There are several things left until I consider this project done. They are:

  • Implement recursive regular expressions
  • Apply the equational optimizations for efficiency.
  • Supply an aggregating semiring so that multiple and correlative semantic extractions can happen at the same time.
  • Supply Adams's LoL: Language of Languages as an interface layer to a richer parsing experience.

These are probably separate projects, as they're specific meta-syntactic wrappers around the final version:

  • Use the semiring implementation to provide a PEG engine
  • Use the PEG to provide Rosie in Rust

And then there's a reality that Russ Cox and Geoff Langdale are so far ahead of me that it's not even funny.

Status

The paper describes itself as "a play," and is presented in three acts; the first act has two scenes, the second and third have three scenes. The end of each scene describes a regular expression engine of some efficiency and capability. In total there are eight different regular expression engines implemented.

Currently written:

Todo

  • An exploration of whether or not extended regular expressions (regular expressions supporting the Intersection, Negation, and Interleaf operations) is theoretically sound.
  • An implementation of a Rigged Brzozowski Algorithm with Adams's and Darais's optimizations.
  • An implementation of a Rigged Brzozowski Algorithm with Might's recursion enabled, using Adams's laziness as its basis. =======

Caveat

This codebase is a testbed for ideas to be incorporated into the Barre, Brzozowski's Algorithm for Rust Regular Expressions. It is now somewhat idle, as most development will be happening in that project, rather than this one.

The last experiment, Rust 9, represents the culmination of the work. It demonstrates the fundamentals of recursive regular expressions with a mathematically sound foundation for "lifting" the processing of a regular expression into a resultant dataset. Recursive regular expressions with variables form the basis of an extremely powerful context-free grammar parser.

Lessons learned

A major goal for this exercise is to illustrate the challenges, and perhaps impossibilities, of porting idiomatic Haskell to idiomatic Rust. The regular expressions being built are static and compiled at compile-time with the rest of the code.

The best thing I've learned is that many Haskell idioms do port over easily to Rust. Looking at the Rust version of accept in the first two examples and comparing them to the Haskell, you can readily see just how straightforward it translates.

On the other hand, Haskell's ability to define list processing recursively really shines in the definitions of split and parts, which are both three lines long in Haskell, but 21 and 29 lines long respectively, in Rust.

The other thing is that the "Rigged Kleene regular expressions in Haskell with Parse Forests" makes me unhappy; I don't have a good theoretical model for the change I made. It turns out my theoretical model was just fine; I just didn't know or understand it sufficiently at the time. The expectation is that yes, you can return an annihilator, or you can return a value. The default value is "one," but the meaning of "one" can be whatever you want it to mean, so long as the Semiring laws hold for the definitions you invent. In short, the value treats the zero and one as "annihilator" and "identity" operations against your return value. When the value is primitive, such as in the Boolean or Integer case, all you need are the zero and one; when the value is compositional, you need the mul operator to do the dirty work for you.

And then you remove the 'identity' pass because, well, the CPU doesn't need to waste time on that.

LICENSE

The Haskell code in Haskell Experiments 01, 02, and 04 is straight from the Play paper and therefor belongs to Sebastian Fischer, Frank Huch, and Thomas Wilke. The code in Haskell Experiments 07 and 08 take the fragments offered in the same paper and realize them as full-blown recognizers and parsers.

All other code is Copyright Elf M. Sternberg (c) 2019, and licensed with the Mozilla Public License vers. 2.0. A copy of the license file is included in the root folder.

About

Weighted Regular Expressions, an experiment in porting an academic Haskell library to Rust

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published