-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: changed all regex function to proc-macros
- Loading branch information
Showing
18 changed files
with
319 additions
and
140 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,23 +1,5 @@ | ||
# Gregex ![crates.io](https://img.shields.io/crates/v/gregex.svg) ![Build Passing](https://github.com/Saphereye/gregex/actions/workflows/ci.yml/badge.svg) | ||
|
||
Gregex is a regular expression solver which utilizes Non-deterministic Finite Automata (NFA) to simulate the input strings. | ||
![](https://github.com/Saphereye/gregex/raw/master/assets/gregex_workflow.excalidraw.svg) | ||
|
||
## Usage | ||
|
||
```rust | ||
extern crate gregex; | ||
use gregex::*; | ||
fn main() { | ||
let tree = dot!(star!('a'), 'b', 'c'); | ||
let regex = regex(&tree); | ||
assert!(regex.run("abc")); | ||
assert!(!regex.run("a")); | ||
assert!(regex.run("aaabc")); | ||
} | ||
``` | ||
|
||
## Theory | ||
The project uses [Glushkov's construction algorithm](https://en.wikipedia.org/wiki/Glushkov%27s_construction_algorithm) for creating the NFA. | ||
|
||
The pipeline can be summarised as below | ||
![](https://github.com/Saphereye/gregex/blob/master/assets/gregex_workflow.excalidraw.svg) | ||
Gregex is a regular expression solver which utilizes Non-deterministic Finite Automata (NFA) to simulate the input strings. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
extern crate gregex; | ||
use gregex::*; | ||
|
||
fn main() { | ||
let runner = regex!(dot!('a', 'b', 'c')); | ||
assert_eq!(runner.run("abc"), true); | ||
assert_eq!(runner.run("ab"), false); | ||
assert_eq!(runner.run("abcd"), false); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
extern crate gregex; | ||
use gregex::*; | ||
|
||
fn main() { | ||
let runner = regex!(or!('a', 'b', 'c')); | ||
assert_eq!(runner.run("a"), true); | ||
assert_eq!(runner.run("b"), true); | ||
assert_eq!(runner.run("c"), true); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
extern crate gregex; | ||
use gregex::*; | ||
|
||
fn main() { | ||
let runner = regex!(star!('a')); | ||
assert_eq!(runner.run("a"), true); | ||
assert_eq!(runner.run("aa"), true); | ||
assert_eq!(runner.run(""), true); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
[package] | ||
name = "gregex-logic" | ||
version = "0.1.0" | ||
edition = "2021" | ||
|
||
[dependencies] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# Gregex Logic | ||
Contains the underlying logic of the Gregex crate. This crate is responsible for converting the Node tree to the NFA. The NFA is then used to match the input string. | ||
|
||
The crate uses the [Glushkov's Construction Algorithm](https://en.wikipedia.org/wiki/Glushkov%27s_construction_algorithm) to convert the Node tree to the NFA. The advantage over the Thompson's Construction Algorithm is that the NFA generated has states equal to number of terminals + 1. Although, the NFA generated by Thumpson's can be converted to the Glushkov's form, by removing the epsilon transitions. | ||
|
||
The `translation` module contains the code to convert the Node tree to the NFA. The `nfa` module contains the code to match the input string with the NFA. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
#[doc = include_str!("../README.md")] | ||
#[cfg(not(doctest))] | ||
pub mod nfa; | ||
pub mod translation; | ||
|
||
use std::sync::atomic::AtomicU32; | ||
pub static TERMINAL_COUNT: AtomicU32 = AtomicU32::new(0); |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
[package] | ||
name = "gregex-macros" | ||
version = "0.1.0" | ||
edition = "2021" | ||
|
||
[dependencies] | ||
gregex-logic = { path = "../gregex-logic" } | ||
syn = { version = "1.0", features = ["full"] } | ||
quote = "1.0" | ||
proc-macro2 = "1.0" | ||
|
||
[lib] | ||
proc-macro = true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Gregex Macros | ||
Contains the macro interface for all the gregex function. | ||
|
||
Without these, users would have to rely on function that generate the Node tree. To explain this we can first look at an example. | ||
|
||
Let's take the regex `a*`. | ||
|
||
The Node tree in our case would be, | ||
```rust | ||
Node::Operation( | ||
Operator::Production, | ||
Box::new(Node::Terminal('a', 0u32)), | ||
None, | ||
) | ||
``` | ||
|
||
Although we can wrap this in a function or a `macro_rules!` macro, the generated code is quite bloated. We can do the hard work during compilation, i.e. converting our regex to the end NFA. | ||
|
||
Currently converting to NFA is not possible, but this crate can convert it to the interstitial form of the Node Tree. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
#[doc = include_str!("../README.md")] | ||
#[cfg(not(doctest))] | ||
extern crate proc_macro; | ||
|
||
use proc_macro::TokenStream; | ||
use quote::quote; | ||
use syn::{parse_macro_input, Expr, ExprLit, ExprMacro, Lit}; | ||
|
||
#[proc_macro] | ||
pub fn dot(input: TokenStream) -> TokenStream { | ||
let inputs = parse_macro_input!(input with syn::punctuated::Punctuated::<Expr, syn::Token![,]>::parse_terminated); | ||
|
||
let nodes = inputs.iter().map(|expr| { | ||
match expr { | ||
Expr::Macro(ExprMacro { mac, .. }) => { | ||
// Handle procedural macro | ||
quote! { #mac } | ||
} | ||
Expr::Lit(ExprLit { lit, .. }) => match lit { | ||
Lit::Char(c) => { | ||
let count = gregex_logic::TERMINAL_COUNT | ||
.fetch_add(1, core::sync::atomic::Ordering::SeqCst); | ||
quote! { | ||
gregex_logic::translation::node::Node::Terminal(#c, #count) | ||
} | ||
} | ||
_ => panic!("Unsupported literal type"), | ||
}, | ||
_ => panic!("Unsupported input type"), | ||
} | ||
}); | ||
|
||
// Generate the code for concatenating nodes | ||
let mut iter = nodes.into_iter(); | ||
let first = iter.next().expect("The input is empty"); | ||
let operations = iter.fold(first, |left, right| { | ||
quote! { | ||
gregex_logic::translation::node::Node::Operation( | ||
gregex_logic::translation::operator::Operator::Concat, | ||
Box::new(#left), | ||
Some(Box::new(#right)) | ||
) | ||
} | ||
}); | ||
|
||
// Generate the final token stream | ||
let gen = quote! { | ||
#operations | ||
}; | ||
|
||
gen.into() | ||
} | ||
|
||
#[proc_macro] | ||
pub fn or(input: TokenStream) -> TokenStream { | ||
let inputs = parse_macro_input!(input with syn::punctuated::Punctuated::<Expr, syn::Token![,]>::parse_terminated); | ||
|
||
let nodes = inputs.iter().map(|expr| { | ||
match expr { | ||
Expr::Macro(ExprMacro { mac, .. }) => { | ||
// Handle procedural macro | ||
quote! { #mac } | ||
} | ||
Expr::Lit(ExprLit { lit, .. }) => match lit { | ||
Lit::Char(c) => { | ||
let count = gregex_logic::TERMINAL_COUNT | ||
.fetch_add(1, core::sync::atomic::Ordering::SeqCst); | ||
quote! { | ||
gregex_logic::translation::node::Node::Terminal(#c, #count) | ||
} | ||
} | ||
_ => panic!("Unsupported literal type"), | ||
}, | ||
_ => panic!("Unsupported input type"), | ||
} | ||
}); | ||
|
||
// Generate the code for concatenating nodes | ||
let mut iter = nodes.into_iter(); | ||
let first = iter.next().expect("The input is empty"); | ||
let operations = iter.fold(first, |left, right| { | ||
quote! { | ||
gregex_logic::translation::node::Node::Operation( | ||
gregex_logic::translation::operator::Operator::Or, | ||
Box::new(#left), | ||
Some(Box::new(#right)) | ||
) | ||
} | ||
}); | ||
|
||
// Generate the final token stream | ||
let gen = quote! { | ||
#operations | ||
}; | ||
|
||
gen.into() | ||
} | ||
|
||
#[proc_macro] | ||
pub fn star(input: TokenStream) -> TokenStream { | ||
let expr = parse_macro_input!(input as Expr); | ||
|
||
let node = match expr { | ||
Expr::Macro(ExprMacro { mac, .. }) => { | ||
// Handle procedural macro | ||
quote! { #mac } | ||
} | ||
Expr::Lit(ExprLit { lit, .. }) => match lit { | ||
Lit::Char(c) => { | ||
let count = | ||
gregex_logic::TERMINAL_COUNT.fetch_add(1, core::sync::atomic::Ordering::SeqCst); | ||
quote! { | ||
gregex_logic::translation::node::Node::Terminal(#c, #count) | ||
} | ||
} | ||
_ => panic!("Unsupported literal type"), | ||
}, | ||
_ => panic!("Unsupported input type"), | ||
}; | ||
|
||
// Generate the code for the star operation | ||
let operation = quote! { | ||
gregex_logic::translation::node::Node::Operation( | ||
gregex_logic::translation::operator::Operator::Production, | ||
Box::new(#node), | ||
None | ||
) | ||
}; | ||
|
||
// Generate the final token stream | ||
let gen = quote! { | ||
#operation | ||
}; | ||
|
||
gen.into() | ||
} | ||
|
||
#[proc_macro] | ||
pub fn regex(input: TokenStream) -> TokenStream { | ||
let expr = parse_macro_input!(input as Expr); | ||
|
||
// Convert the input expression into a Node structure | ||
let node = match expr { | ||
Expr::Macro(ExprMacro { mac, .. }) => { | ||
// Handle procedural macro | ||
quote! { #mac } | ||
} | ||
Expr::Lit(ExprLit { lit, .. }) => match lit { | ||
Lit::Char(c) => { | ||
let count = | ||
gregex_logic::TERMINAL_COUNT.fetch_add(1, core::sync::atomic::Ordering::SeqCst); | ||
quote! { | ||
gregex_logic::translation::node::Node::Terminal(#c, #count) | ||
} | ||
} | ||
_ => panic!("Unsupported literal type"), | ||
}, | ||
_ => panic!("Unsupported input type"), | ||
}; | ||
|
||
// Generate the code to convert the Node into a Regex | ||
let gen = quote! { | ||
{ | ||
let regex_tree = #node; | ||
let prefix_set = gregex_logic::translation::node::prefix_set(®ex_tree); | ||
let suffix_set = gregex_logic::translation::node::suffix_set(®ex_tree); | ||
let factors_set = gregex_logic::translation::node::factors_set(®ex_tree); | ||
gregex_logic::nfa::NFA::set_to_nfa(&prefix_set, &suffix_set, &factors_set) | ||
} | ||
}; | ||
|
||
gen.into() | ||
} |
Oops, something went wrong.