Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prioritize UDF over scalar built-in function in case of function name… #6601

Merged
merged 3 commits into from
Jun 12, 2023

Conversation

epsio-banay
Copy link
Contributor

@epsio-banay epsio-banay commented Jun 8, 2023

Which issue does this PR close?

Closes #6588.

Rationale for this change

This PR lets a user override the built-in implementation of scalar functions (such as abs, floor, etc) using the ScalarUdf feature.
We are currently only using the logical planner of Datafusion, this feature would allow us and others to have more control over the built-in scalar functions.
It should also be inline with Datafusion's intention to be customizable.

What changes are included in this PR?

Scalar UDFs will have precedence over Scalar built-in functions in case of name conflicts.
For example if a user registers a UDF with the name "abs" (as the built-in function called "abs"), The sql "SELECT abs(10)" will use the UDF instead of the built-in function.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes.
DataFusion will prioritize UDF over built-in scalar function in case of function name conflict.

@github-actions github-actions bot added core Core DataFusion crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Jun 8, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @epsio-banay ! I left some comments -- let me know if they don't make sense

@@ -191,6 +191,8 @@ config_namespace! {
/// MySQL, PostgreSQL, Hive, SQLite, Snowflake, Redshift, MsSQL, ClickHouse, BigQuery, and Ansi.
pub dialect: String, default = "generic".to_string()

/// Should DataFusion prioritize UDF over built-in scalar function in case of function name conflict.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think a config flag is needed -- if a UDF is registered with the same name as a built in, it will not be callable (as you note in the issue / PR description). Therefore I think it is safe to assume that the if a user registers a function with the same name as a built in they want the new function to have precidence

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Contributor Author

@epsio-banay epsio-banay Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Thanks for responding so fast, this PR got much more simple.
I'll also edit the PR description - remove the described config flag

));

// Test data
let test_data = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than checking the plan here, what do you think about simply running the query and verifying the output?

For example, you could run the query

SELECT ABS(-100)

And if it returns 1 you know the UDF is called, and if it returns 100 you know the built in was called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in scalar_udf_override_built_in_scalar_function

@alamb
Copy link
Contributor

alamb commented Jun 9, 2023

Marking as draft as this PR has feedback and is awaiting a response (I am trying to keep the list of PRs needing review clear) -- please mark as ready for review when ready

@alamb alamb marked this pull request as draft June 9, 2023 13:37
@epsio-banay epsio-banay force-pushed the feature/prioritize_udf branch 2 times, most recently from 180a167 to 319df05 Compare June 11, 2023 08:27
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Jun 11, 2023
@epsio-banay epsio-banay marked this pull request as ready for review June 11, 2023 10:53
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me @epsio-banay -- thank you very much for the contribution

@alamb
Copy link
Contributor

alamb commented Jun 11, 2023

I think this branch has a logical merge conflict with some of the changes on master. If I applied this diff locally the tests passed:

diff --git a/datafusion/core/tests/sql/udf.rs b/datafusion/core/tests/sql/udf.rs
index 0ecd5d0fd..eb76c3417 100644
--- a/datafusion/core/tests/sql/udf.rs
+++ b/datafusion/core/tests/sql/udf.rs
@@ -185,7 +185,7 @@ async fn scalar_udf_override_built_in_scalar_function() -> Result<()> {
 
     let batch = RecordBatch::try_new(
         Arc::new(schema.clone()),
-        vec![Arc::new(Int32Array::from(vec![-100]))],
+        vec![Arc::new(Int32Array::from_slice([-100]))],
     )?;
     let ctx = SessionContext::new();
 

@epsio-banay
Copy link
Contributor Author

I think this branch has a logical merge conflict with some of the changes on master. If I applied this diff locally the tests passed:

diff --git a/datafusion/core/tests/sql/udf.rs b/datafusion/core/tests/sql/udf.rs
index 0ecd5d0fd..eb76c3417 100644
--- a/datafusion/core/tests/sql/udf.rs
+++ b/datafusion/core/tests/sql/udf.rs
@@ -185,7 +185,7 @@ async fn scalar_udf_override_built_in_scalar_function() -> Result<()> {
 
     let batch = RecordBatch::try_new(
         Arc::new(schema.clone()),
-        vec![Arc::new(Int32Array::from(vec![-100]))],
+        vec![Arc::new(Int32Array::from_slice([-100]))],
     )?;
     let ctx = SessionContext::new();
 

rebased and fixed:)

@alamb alamb merged commit c96949e into apache:main Jun 12, 2023
@alamb
Copy link
Contributor

alamb commented Jun 12, 2023

Thanks again @epsio-banay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prioritize UDF over scalar built-in function in case of name collision
3 participants