Add support for PostgreSQL regex match#870
Conversation
|
Thanks @b41sh -- I plan to review this PR tomorrow morning. I apologize for being backed up on review |
alamb
left a comment
There was a problem hiding this comment.
I tested this PR out locally and it looks great 🥇 Nice work @b41sh! I found the code clean and beautiful to review.
The only other thing I can think of that might be worth adding is some end-to-end tests in datafusion/tests/sql.rs.
For anyone else interested, here is how I tested:
Get updated sqlparser
diff --git a/Cargo.toml b/Cargo.toml
index d6da8c14c..bb3aa2001 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -29,3 +29,6 @@ members = [
]
exclude = ["python"]
+
+[patch.crates-io]
+sqlparser = { git = "https://github.com/b41sh/sqlparser-rs.git", branch = "regexp_match"}
\ No newline at end of file
diff --git a/ballista/rust/core/Cargo.toml b/ballista/rust/core/Cargo.toml
index b2fa50c88..66f61b6e6 100644
--- a/ballista/rust/core/Cargo.toml
+++ b/ballista/rust/core/Cargo.toml
@@ -37,7 +37,7 @@ hashbrown = "0.11"
log = "0.4"
prost = "0.8"
serde = {version = "1", features = ["derive"]}
-sqlparser = "0.9.0"
+sqlparser = "0.9.1-alpha.0"
tokio = "1.0"
tonic = "0.5"
uuid = { version = "0.8", features = ["v4"] }
diff --git a/datafusion/Cargo.toml b/datafusion/Cargo.toml
index 286be8a7a..9819275ab 100644
--- a/datafusion/Cargo.toml
+++ b/datafusion/Cargo.toml
@@ -51,7 +51,7 @@ ahash = "0.7"
hashbrown = { version = "0.11", features = ["raw"] }
arrow = { version = "5.1", features = ["prettyprint"] }
parquet = { version = "5.1", features = ["arrow"] }
-sqlparser = "0.9.0"
+sqlparser = "0.9.1-alpha.0"
paste = "^1.0"
num_cpus = "1.13.0"
chrono = "0.4"Run datafusion-cli:
echo "foo" > /tmp/foo.csv
echo "Bar" >> /tmp/foo.csv
echo "Baz" >> /tmp/foo.csv
echo "ZZ" >> /tmp/foo.csv
cargo run -p datafusion-cli> CREATE EXTERNAL TABLE foo(a VARCHAR) STORED AS CSV LOCATION '/tmp/foo.csv';
0 rows in set. Query took 0.002 seconds.
> select * from foo where a ~ 'z';
+-----+
| a |
+-----+
| Baz |
+-----+
1 row in set. Query took 0.012 seconds.
> select * from foo where a ~* 'z';
+-----+
| a |
+-----+
| Baz |
| ZZ |
+-----+
2 rows in set. Query took 0.026 seconds.
> select * from foo where a !~ 'z';
+-----+
| a |
+-----+
| foo |
| Bar |
| ZZ |
+-----+
3 rows in set. Query took 0.012 seconds.
> select * from foo where a !~* 'z';
+-----+
| a |
+-----+
| foo |
| Bar |
+-----+
2 rows in set. Query took 0.012 seconds.| .expect("regexp_match_array_op failed to downcast array"); | ||
|
|
||
| let array = match $CASE { | ||
| true => regexp_match(&ll, &rr, Some(&$ARRAYTYPE::from(vec!["i"; ll.len()])))?, |
There was a problem hiding this comment.
It unfortunate that the arrow-rs doesn't have a regexp_filter style kernel that returns a BooleanArray --
https://sourcegraph.com/github.com/apache/arrow-rs/-/blob/arrow/src/compute/kernels/regexp.rs?L16
I filed apache/arrow-rs#697 and #905 to track that potential work, if we want to improve performance of these operators in future versions of DataFusion,
|
#934 is now merged, so upon rebase this PR should pass CI and be ready to go |
|
The python test failures have been resolved. One more rebase and I bet this one can finally be merged in. Super kudos for keeping with it @b41sh -- thank you |
…he#870) * basic version of string to float/double/decimal * docs * update benches * update benches * rust doc
…he#870) * basic version of string to float/double/decimal * docs * update benches * update benches * rust doc
Which issue does this PR close?
Closes #795
Rationale for this change
Implement PostgreSQL pattern matching operator
What changes are included in this PR?
regexp_matchfunction to do regular expressionAre there any user-facing changes?
support pattern matching using POSIX regular expressions