Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port regexp_like function and port related tests #9397

Merged
merged 7 commits into from
Mar 4, 2024

Conversation

Lordworms
Copy link
Contributor

Which issue does this PR close?

#9328 part II

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Feb 29, 2024
@Lordworms Lordworms marked this pull request as draft February 29, 2024 02:50
@github-actions github-actions bot added the core Core DataFusion crate label Feb 29, 2024
@Lordworms Lordworms force-pushed the issue_9328_2 branch 2 times, most recently from 6949611 to 611e23a Compare March 2, 2024 01:45
@Lordworms Lordworms marked this pull request as ready for review March 2, 2024 02:47
// specific language governing permissions and limitations
// under the License.

//! Encoding expressions
Copy link
Contributor

@yyy1000 yyy1000 Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the repo and thought it may be a small typo, and it also appears in the previous regex function.
It's not related to Encoding expressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks buddy, I will fix it

@@ -61,6 +61,7 @@ chrono = { workspace = true }
datafusion-common = { workspace = true, default-features = true }
datafusion-execution = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-functions = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be moved into dev-dependencies? I didn't verify and just be curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will verify it , Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have a dependence between datafusion-functions and datafusion-physical-exprs as we are trying to make the crates more indenpendent. I commented below on some ways to avoid this dependency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Lordworms -- this looks very close . I agree with @yyy1000 's suggestion / concern about adding the new dependency. Otherwise this PR is 👌

Thanks again @yyy1000 for the review

@@ -61,6 +61,7 @@ chrono = { workspace = true }
datafusion-common = { workspace = true, default-features = true }
datafusion-execution = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-functions = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have a dependence between datafusion-functions and datafusion-physical-exprs as we are trying to make the crates more indenpendent. I commented below on some ways to avoid this dependency

@@ -23,15 +23,12 @@ use std::sync::Arc;
use arrow_array::builder::StringBuilder;
use arrow_array::{ArrayRef, StringArray};
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use datafusion_functions::regex::regexplike::regexp_like;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than making this dependency, I think we could instead move the benchmark (for regexp_like) to datafusion/functions/benches/regex.rs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be a great idea. Should I port all other regexp benchmark in this PR or in the following one? I will port regexp_like first

@@ -78,20 +74,6 @@ fn flags(rng: &mut ThreadRng) -> StringArray {
}

fn criterion_benchmark(c: &mut Criterion) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will file another PR to finish port regx_replace and delete this bench later today

delete useless test

fix chores

change regexp_like

change lock

change format

fix typo

change prost

remove unused

adding usage

adding dependency

adding dependency

dep
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Lordworms -- I reviewed this PR and it looks good to me. I want to double check that this doesn't change performance (there are some ClickBench queries that use regexp) so I will run the benchmarks and report back before merging

@alamb
Copy link
Contributor

alamb commented Mar 3, 2024

My ClickBench benchmark run shows no significant change:

Details

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ issue_9328_2 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.77ms │       0.75ms │     no change │
│ QQuery 1     │    85.19ms │      87.65ms │     no change │
│ QQuery 2     │   185.70ms │     188.47ms │     no change │
│ QQuery 3     │   203.55ms │     202.47ms │     no change │
│ QQuery 4     │  2014.07ms │    1993.45ms │     no change │
│ QQuery 5     │  1835.24ms │    1883.81ms │     no change │
│ QQuery 6     │    84.19ms │      82.70ms │     no change │
│ QQuery 7     │    90.61ms │      90.52ms │     no change │
│ QQuery 8     │  3012.36ms │    2982.53ms │     no change │
│ QQuery 9     │  2226.45ms │    2153.82ms │     no change │
│ QQuery 10    │   797.30ms │     798.08ms │     no change │
│ QQuery 11    │   885.18ms │     882.21ms │     no change │
│ QQuery 12    │  1996.77ms │    2002.07ms │     no change │
│ QQuery 13    │  4340.28ms │    4215.03ms │     no change │
│ QQuery 14    │  2652.14ms │    2645.28ms │     no change │
│ QQuery 15    │  2230.46ms │    2212.42ms │     no change │
│ QQuery 16    │  5533.51ms │    5411.94ms │     no change │
│ QQuery 17    │  5419.85ms │    5256.58ms │     no change │
│ QQuery 18    │ 11082.06ms │   10974.93ms │     no change │
│ QQuery 19    │   160.32ms │     159.33ms │     no change │
│ QQuery 20    │  2525.27ms │    2560.97ms │     no change │
│ QQuery 21    │  3329.14ms │    3289.60ms │     no change │
│ QQuery 22    │  9023.08ms │    8949.14ms │     no change │
│ QQuery 23    │ 21485.68ms │   21301.62ms │     no change │
│ QQuery 24    │  1337.12ms │    1319.03ms │     no change │
│ QQuery 25    │  1139.19ms │    1137.60ms │     no change │
│ QQuery 26    │  1438.28ms │    1424.81ms │     no change │
│ QQuery 27    │  3805.54ms │    3801.18ms │     no change │
│ QQuery 28    │ 30573.75ms │   30438.09ms │     no change │
│ QQuery 29    │  1055.06ms │    1038.96ms │     no change │
│ QQuery 30    │  2395.69ms │    2332.10ms │     no change │
│ QQuery 31    │  3131.58ms │    3007.00ms │     no change │
│ QQuery 32    │ 15260.30ms │   15058.24ms │     no change │
│ QQuery 33    │  8760.71ms │    8810.44ms │     no change │
│ QQuery 34    │ 13080.73ms │   12910.91ms │     no change │
│ QQuery 35    │  3627.75ms │    3601.54ms │     no change │
│ QQuery 36    │   332.65ms │     315.64ms │ +1.05x faster │
│ QQuery 37    │   219.30ms │     221.42ms │     no change │
│ QQuery 38    │   186.01ms │     189.70ms │     no change │
│ QQuery 39    │  1085.73ms │    1061.86ms │     no change │
│ QQuery 40    │    85.20ms │      86.09ms │     no change │
│ QQuery 41    │    78.69ms │      80.66ms │     no change │
│ QQuery 42    │    90.88ms │      91.84ms │     no change │
└──────────────┴────────────┴──────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary           ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main_base)      │ 168883.32ms │
│ Total Time (issue_9328_2)   │ 167252.48ms │
│ Average Time (main_base)    │   3927.52ms │
│ Average Time (issue_9328_2) │   3889.59ms │
│ Queries Faster              │           1 │
│ Queries Slower              │           0 │
│ Queries with No Change      │          42 │
└─────────────────────────────┴─────────────┘

@alamb alamb merged commit 5188a5d into apache:main Mar 4, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 4, 2024

Thanks again @Lordworms

wiedld pushed a commit to wiedld/arrow-datafusion that referenced this pull request Mar 21, 2024
* port regexp_like function and port related tests

delete useless test

fix chores

change regexp_like

change lock

change format

fix typo

change prost

remove unused

adding usage

adding dependency

adding dependency

dep

* adding tests

* remove useless

* remove unused

* remove unused

* change dependency structure

* format toml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants