[datafusion-spark] Implement ceil&floor function for spark#15958
[datafusion-spark] Implement ceil&floor function for spark#15958irenjj wants to merge 5 commits intoapache:mainfrom
Conversation
| fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> { | ||
| Err(generic_internal_err( | ||
| "ceil", | ||
| "`return_type` should not be called, call `return_type_from_args` instead", | ||
| )) | ||
| } |
There was a problem hiding this comment.
Hi @shehabgamin, we encountered an issue here: most of the code was copied from sail. In sail, the return value is constructed using return_type_from_args, which requires access to both args.arg_types and args.scalar_arguments to determine the final value. However, in the latest version of DF, return_type_from_args has been removed. It seems that relying solely on datatype is no longer sufficient to construct the return value within return_type().
There was a problem hiding this comment.
@irenjj Would return_field_from_args work here?
datafusion/datafusion/expr/src/udf.rs
Lines 447 to 487 in 7e89862
|
@irenjj Here are some tests for |
|
@irenjj I'll review this by tomorrow! I'm not a committer though, so we'll still need a review from someone else as well. cc @alamb @andygrove |
|
Hi @irenjj. Could you explain in the PR description how these functions differ from the standard DataFusion implementation (i.e., what is Spark-specific about them)? That will help with reviews. |
|
@parthchandra @huaxingao @comphead @mbutrovich @kazuyukitanimura, FYI, in case you want to review. In Comet, we are currently using DataFusion's ceil and floor functions and have not specialized for Spark. |
| let arg_len = args.args.len(); | ||
| let target_scale = if arg_len == 1 { | ||
| Ok(&None) | ||
| } else if arg_len == 2 { |
There was a problem hiding this comment.
Floor and Ceil expressions in Spark are unary expressions and only take a single argument:
case class Ceil(child: _root_.org.apache.spark.sql.catalyst.expressions.Expression) extends _root_.org.apache.spark.sql.catalyst.expressions.UnaryMathExpression {When calling ceil in Spark SQL with two parameters, it looks like a cast is added to the input expression. For example, ceil(_1, 2) translates to plan: *(1) Project [ceil(cast(_1#6 as decimal(30,15)), 2) AS ceil(_1, 2)#10.
Perhaps you implemented the two-argument version to work around the fact that DataFusion's logical optimizer doesn't implement the same logic as Spark for adding this cast? I wonder if it is worth looking into adding a Spark-compatible optimizer rule to add this cast in the future.
There was a problem hiding this comment.
Actually, I have confused myself now. The plan does still show two arguments, but the physical plan only has a single argument 😕
There was a problem hiding this comment.
Ah, there is a separate RoundCeil in Spark for the two-argument case:
case class RoundCeil(child: Expression, scale: Expression)There was a problem hiding this comment.
Got it. In this PR, we only need to implement the single-argument version of ceil, and add the new function roundceil as a two-argument implementation in a future PR.
There was a problem hiding this comment.
I think it is fine to have both cases covered in a single expression in DataFusion. Sorry for the confusion.
| query error | ||
| SELECT ceil(3.1411, -3); | ||
| ---- | ||
| DataFusion error: Error during planning: Failed to coerce arguments to satisfy a call to 'ceil' function: coercion from [Float64, Int64] to the signature Uniform(1, [Float64, Float32]) failed |
There was a problem hiding this comment.
I assume that these errors need to be fixed still
There was a problem hiding this comment.
@irenjj Sail doesn't use the same SQL parser as DataFusion, as it follows Spark SQL syntax.
You’ll need to cast values to ensure the tests interpret the correct DataType.
For example, instead of this:
SELECT ceil(3.1411, -3);
Use:
SELECT ceil(3.1411::decimal(5,4), -3::INT);
Copy-pasting the exact queries from the files below won't work directly. However, if you look closely, you’ll see that each query includes the necessary cast for the first argument (the second argument is always an INT):
- https://github.com/lakehq/sail/blob/main/python/pysail/tests/spark/function/test_ceil.txt
- https://github.com/lakehq/sail/blob/main/python/pysail/tests/spark/function/test_floor.txt
However, you should expect the same result values from those two files.
shehabgamin
left a comment
There was a problem hiding this comment.
Nice progress! Just gotta clean up the test files a bit :)
| query error | ||
| SELECT ceil(3.1411, -3); | ||
| ---- | ||
| DataFusion error: Error during planning: Failed to coerce arguments to satisfy a call to 'ceil' function: coercion from [Float64, Int64] to the signature Uniform(1, [Float64, Float32]) failed |
There was a problem hiding this comment.
@irenjj Sail doesn't use the same SQL parser as DataFusion, as it follows Spark SQL syntax.
You’ll need to cast values to ensure the tests interpret the correct DataType.
For example, instead of this:
SELECT ceil(3.1411, -3);
Use:
SELECT ceil(3.1411::decimal(5,4), -3::INT);
Copy-pasting the exact queries from the files below won't work directly. However, if you look closely, you’ll see that each query includes the necessary cast for the first argument (the second argument is always an INT):
- https://github.com/lakehq/sail/blob/main/python/pysail/tests/spark/function/test_ceil.txt
- https://github.com/lakehq/sail/blob/main/python/pysail/tests/spark/function/test_floor.txt
However, you should expect the same result values from those two files.
|
@irenjj - I wonder if you would be willing to help pick this PR back up now that we have merged a PR with a bunch of tests from @shehabgamin here: We are hoping to generate some example PRs to use when starting to fill out the rest of the functions and I was thinking this one might be a good one |
Sure! I need some time to figure out why new added ceil/floor only works with one argument(it seems datafusion has an original implementation for ceil/floor), i'll do this work next week(little busy this week) -- Sorry for dragging out this PR for so long @shehabgamin @andygrove |
Thanks @irenjj -- no worries about the timing. I know you are busy and are also on the leading edge of these spark functions (so the process of implementation is a bit rough). Thank you for helping us figure it out |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
ceilfunction #15916Rationale for this change
Add Spark-style ceil&floor function, the differences between DataFusion and Spark's implementations of mathematical functions
ceilandfloor:Function Signatures:
ceil(x, scale)accepting two parametersceil(x)Type Support:
What changes are included in this PR?
Add new ceil&floor function.
Are these changes tested?
Yes.
Are there any user-facing changes?
No.