regexc is a regular expression compiler. It takes a regular expression as input and creates a binary, that matches that regular expression, as output.
- Introduction
- How To Build
- How To Use Regexc
- Regular Expression Syntax
- For Developers
- Future Work
- Helpful Links
My goal for this project was to explore using prolog in a practical setting. That means understanding things like developer workflow, project layout, and argument parsing. Although the use of the this project is a bit contrived, I hope that it can serve as an accessable example for what a practical prolog application might look like.
You will need to install SWI Prolog. I recomend using your package manager of choice. I'm on macOS so I used brew.
brew install swi-prolog
From there just run make
. This will create a distributable in build
.
$ make
$ file build/regexc
This project is still under development. At this stage the compiler pipeline is not complete.
At the moment regexc can parse regular expressions, and covert the abstract syntax tree into a Non-Deterministic Automata. Regexc can dump its intermediate representations as dot files. To use dot files I would recomend installing graphviz.
Here is an example of how to use regexc in its current form.
regexc -r "\d{2,3}-\d{4,7}-[a-zA-Z_]+" --ast-dot /tmp/ast.dot --nfa-dot /tmp/nfa.dot &&
dot -Tsvg -o /tmp/ast.svg ast.dot &&
dot -Tsvg -o /tmp/nfa.svg nfa.dot &&
open /tmp/ast.svg /tmp/nfa.svg
Regexc should provide a useful error message if it fails to parse a regular expression.
$ regexc -r "what is (this|that"
ERROR: No closing parenthesis at 8
what is (this|that
^
ERROR: No strings were parsed successfully
Exiting due to above Errors...
A generic error is provided, but I am working towards eliminating the case where it is seen.
$ regexc -r "\d{2-3}"
ERROR: Could not parse string at 0
\d{2-3}
^
ERROR: No strings were parsed successfully
Exiting due to above Errors...
Regexc's supported syntax is meant to be similiar to existing implementations. Broadly speaking here are the supported operations.
Expression Grammer (Ignore spaces)
Expr -> Expr Expr # Concatenate two expressions
Expr | Expr # Logical OR of two expresions
Expr * # Expression can occur any number of times
Expr ? # Expression can occur once or not at all
Expr + # Expression must occur at least once
( Expr ) # Group the expression
Expr {Number} # Expression must occur Integer number of times
Expr { Maybe_Number , Maybe_Number } # Expression Must occur at least the first field
# Expression May not occur more than second field
# Where Maybe_Number is:
# Not defined and thus no bound
# OR
# Some number bound
Expr -> Single_Char # Matches a single character
The syntax supports a variety of ways of matching a single character. There is the wildcard, .
, that can match
any character.
In general, a character in the expression will be taken to literally match that character. Control and Operator characters must be backslash escaped to be taken literally.
Control Symbols = ['\\', '(', ')', '[', ']', '-']
Operator Symbols = ['+', '*', '?', '{', '}', '|', '.']
For example, "\(" Would then match the '(' instead of being interpreted as starting a group.
There are also classes that can match one of a set of characters. These are specified with [ Members ]
notations.
Class_Definition -> [ Member Members ] # Classes must contain at least one member
Members -> Member Members #
[]
Member -> Single_Char # A single char to include in the class
-> Single_Char - Single_Char # A range of characters to include in the class
For example, "[a-z_]" would match any lowercase letter or '_'.
We also provide some shortcuts to commonly used classes. These class shortcuts can be used freely outside of class defintions, but can only be used in place of a range when in a class defintions. That is, one cannot use a class shortcut as a bound in a range.
\d -> "[0-9]" # Any Digit
\D -> "[^0-9]" # Any non-Digit
\w -> "[a-zA-Z0-9_-]" # Any "word" character
\W -> "[^a-zA-Z0-9_-]" # Any non-word character
\s -> "[ \t]" # Whitespace character
\S -> "[^ \t]" # Not a whitespace character
You will need to install SWI Prolog. I recomend using your package manager of choice. I'm on macOS so I used brew.
brew install swi-prolog
To run the documenation server, use the doc_server
target in the Makefile.
$ make doc_server
This will launch a swipl instance. Use the halt/0
predicate when you want to stop the server.
To create static documentation use the static_docs
target int the Makefile.
$ make static_docs
You can open the resulting html document at ./build/static_documentation/index.html
. However, when I make
the static_docs this way, the content does not match what I see when running the doc_server. The static_docs
seem to show that only the first predicate in each file is properly documented. I plan on fixing this.
While working on the source code, I recommend starting a SWI-Prolog session that initializes with the load module. Whenever you want to consult your changes, run make/0.
$ swipl -f src/load.pl
<blah blah>
?-
...
?- make.
% <blah blah>
true.
I recommend taking advantage of the top-level (REPL if you prefer), prolog feels really natural in this environment. Here is an example where we parse a string and write the AST's dot representation to a file.
?- S = "0x[a-f0-9]+".
S = "0x[a-f0-9]+".
?- string_ast($S, AST, Errors).
AST = ast_concat(ast_range(48, 48), ast_concat(ast_range(120, 120), ast_occurance(ast_or(ast_range(97, 102), ast_
range(48, 57)), some(1), none))),
Errors = [],
S = "1x[a-f0-9]+".
?- write_to_file_once(ast_to_dot($AST), "/tmp/ast.dot", _).
AST = ast_concat(ast_range(48, 48), ast_concat(ast_range(120, 120), ast_occurance(ast_or(ast_range(97, 102), ast_range(48, 57)), some(1), none))).
The Makefile has a test
target, but that includes a code coverage report as well. You can use
the run_tests/0
goal to run the tests without the extra machinery to find code coverage.
# Run Tests with the Makefile
$ make test
# Run Coverage tests without coverage results
$ swipl -f src/load.pl -g run_tests
In addition, the project is currently configured to use TravisCI.
To run the main interface without packaging it you can simply load src/interface.pl
. Arguments meant
for for the optparse application can be passed after a --
. For example:
swipl -f src/interface.pl -- -r "\d{2,3}"
- No trailing whitespace.
- Please keep one letter variable names to the absolute minimum. I will accept 'C' as a stand-in for 'Character' for example. 'C1' and 'C2' are probably not a good idea though. Please strive for descriptive variable names.
- Functor and variable names should be composed with '_'. No Camelcase.
- To clarify, Variables will be a bunch of captilized words,
Variable_Name
, and functors will all be lowercase,functor_name
.
- To clarify, Variables will be a bunch of captilized words,
- If you are adding a new module, please add it to
src/load.pl
. This will, among other things, ensure it remains tested.
I would like to have an automatic source formatter in the future.
I am not concerned with compatability for other prolog distributions. While I am always willing to hear other opinions, for now it seems to me that SWI-Prolog is the only prolog distribtion with a solid and public ecosystem.
SWI Prolog provides an optparse module that is modeled after python's own optparse. This module provides the minimum of what I would expect from an argument parsing library in any language: one specification can be used to generate both a help message and parse the arguments.
SWI Prolog includes a unit test package, plunit.
There are unit tests in every module. In addition, there will be a module for defining integration tests.
-
I have not found a good solution for source formatting. There are some promising avenues however.
-
package-indent is written by Jan Wielemaker himself. Initially, I found that it did not produce the expected results (removed lines for example). However I was likely holding it wrong, did not have well made code.
-
vcs-prolog is a VS Code plugin that provides, among other things, prolog source formatting. It is much newer than package-indent, and was also a supervised by Jan Wielmaker.
-
That being said, I have started to investigate what it would take to create a formatting tool. Prolog has several useful facilities for parsing and formatting code. However these utilities quickly drop comments, which means that a formatter that takes comments within the predicate defintion would require far more work. This might be worth it, or it may be wise to avoid this situation.
-
I had trouble with the optparse module. The trouble was not unique to that module, however. The issue I faced was the the app specification is a very specific kind of structure. Any goals that expected that structure would simply fail if it did not meet match the expectations. In this case I think it would be wise to have some goal that can type-check or parse the application structure and give detailed feedback on why it does not meet expections. Adding such a goal to SWI Prolog might be worth opening an issue / making a pull request.
opt_check_spec(+OptsSpec, -Errors)
I found this project to be a useful example of many prolog techniques. Alan Compiler by thanosqr
I found this SO question useful when investigating how to handle input. String Stream in Prolog?
I found this article useful for adding travis support. Adding Travis to SWI-Prolog packs
The regex crate for Rust has a useful set of tools a good documentation for their syntax. I would like to someday be fully compatable with this specification. Transitivley, this would make regexc compatable with ripgrep. regex-rs syntax documentation
The author of ripgrep wrote a blog post about text search that is a must-read if you are interested in that problem space. ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}
This stackoverflow question got me on the right track for handling character ranges. How to implement regular expression NFA with character ranges?
This helped me understand unicode in a way I hadn't before. What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text