-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-28741][SQL]Optional mode: throw exceptions when casting to integers causes overflow #25461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #109151 has finished for PR 25461 at commit
|
retest this please. |
retest this please |
Test build #109156 has finished for PR 25461 at commit
|
Test build #109178 has finished for PR 25461 at commit
|
retest this please. |
Also cc @cloud-fan @mgaido91 |
Test build #109196 has started for PR 25461 at commit |
Test build #109193 has finished for PR 25461 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala
Outdated
Show resolved
Hide resolved
|
||
private[this] def castDecimalToIntegerCode( | ||
ctx: CodegenContext, | ||
intType: String): CastFunction = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intType
? Do you mean inType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually integerType
. But the full name makes some code longer than 100 characters.
I can change it if you think it is misleading.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
case x: NumericType if failOnIntegerOverflow => | ||
b => | ||
val intValue = try { | ||
x.exactNumeric.asInstanceOf[Numeric[Any]].toInt(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you cast it into int once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trait Numeric
doesn't have the method toInt
. Before this code change, the value is also casted to int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot check the valid value range in a single place instead of the current two checks in line 520 and 525?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we can do it by match it case by case. Then the code is a bit long. Casting to short/byte should be minor usage. Also, The previous code also cast to Int
before cast to Short
.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
checkEvaluation(cast(Literal(value, TimestampType), LongType), | ||
Math.floorDiv(value, MICROS_PER_SECOND)) | ||
} | ||
checkEvaluation(cast(9223372036854775807.9f, LongType), 9223372036854775807L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about doing boundary tests like this?
checkEvaluation(cast(java.lang.Math.nextDown(9223372036854775807.9f), LongType), 9223372036854775807L)
--> non-overflow case
checkEvaluation(cast(java.lang.Math.nextUp(9223372036854775807.9f), LongType), 9223372036854775807L)
--> overflow case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we do that?
scala> java.lang.Math.nextDown(9223372036854775807.9D) < 9223372036854775807.9D
res23: Boolean = true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, its ok to do it like this instead;
checkEvaluation(cast(9223372036854775807.9f, LongType), 9223372036854775807L)
--> non-overflow case
checkEvaluation(cast(java.lang.Math.nextUp(9223372036854775807.9f), LongType), 9223372036854775807L)
--> overflow case
What I'm a little worried about is that 9223372036854775807.9f
is implicitly truncated (to 9223372036854776000.0f
?) by a compiler because it cannot be packed in the float IEEE754 format as you said before. So, IIUC the test is actually the same with cast(9223372036854776000.0f, LongType)
?
What I understand is as follows(sorted by values desc) and is this correct?
IEEE754 continuous float values
------------------------------------------
overflow case: 9223373136366404000.0f <--- Math.nextUp(9223372036854775807.9f)
non-overflow case: 9223372036854776000.0f <--- 9223372036854775807.9f
non-overflow case: 9223371487098961900.0f <--- Math.nextDown(9223372036854775807.9f)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Yes, I think you are right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the check!
Test build #109297 has finished for PR 25461 at commit
|
retest this please. |
Test build #109300 has finished for PR 25461 at commit
|
|
||
test("Cast to byte with option FAIL_ON_INTEGER_OVERFLOW enabled") { | ||
withSQLConf(SQLConf.FAIL_ON_INTEGER_OVERFLOW.key -> "true") { | ||
testIntMaxAndMin(ByteType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need to test cast int.max +1 to byte
? I think it's good enough to test cast byte.max +1 to byte
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is always good to have more test cases here if it doesn't increase the testing time by a few seconds.
For example,
If casting double to byte is implemented as:
val x = doubleValue.toShort
if (x.toByte == x) {
x.toByte
} else {
throw new ...
}
We can find that it is wrong with this test case, because
(Int.MaxValue+1.0).toShort.toByte == (Int.MaxValue+1.0).toShort
is true.
|
||
test("Cast to short with option FAIL_ON_INTEGER_OVERFLOW enabled") { | ||
withSQLConf(SQLConf.FAIL_ON_INTEGER_OVERFLOW.key -> "true") { | ||
testIntMaxAndMin(ShortType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
} | ||
checkEvaluation(cast(9223372036854775807.9f, LongType), 9223372036854775807L) | ||
checkEvaluation(cast(-9223372036854775808.9f, LongType), -9223372036854775808L) | ||
checkEvaluation(cast(9223372036854775807.9D, LongType), 9223372036854775807L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about checkEvaluation(cast(0.9D + Long.Max, LongType), Long.Max)
?
LGTM except a few minor comments |
@@ -474,8 +477,12 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String | |||
buildCast[Boolean](_, b => if (b) 1 else 0) | |||
case DateType => | |||
buildCast[Int](_, d => null) | |||
case TimestampType if failOnIntegerOverflow => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need this? AFAIK a timestamp cannot overflow, can it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible in theory.
@@ -1182,6 +1233,78 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String | |||
(c, evPrim, evNull) => code"$evPrim = $c != 0;" | |||
} | |||
|
|||
private[this] def castTimestampToIntegerCode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[this] def castTimestampToIntegerCode( | |
private[this] def castTimestampToIntegralCode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integral
is adjective. "Integral code" seems weird to me.
We can call it castTimestampToIntegralTypeCode
if you insist. I didn't use it because the name is a bit long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, Integer
is a specific data type, so I think this name is misleading...your suggested one is fine to me
(c, evPrim, evNull) => code"$evPrim = $c.to${integralType.capitalize}($failOnIntegerOverflow);" | ||
} | ||
|
||
private[this] def castIntegerToIntegerExactCode(integralType: String): CastFunction = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[this] def castIntegerToIntegerExactCode(integralType: String): CastFunction = { | |
private[this] def castIntegerToIntegralExactCode(integralType: String): CastFunction = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(oh, I didn't know this functionality, ```suggestion, cool)
(min.toString + typeIndicator, max.toString + typeIndicator) | ||
} | ||
|
||
private[this] def castFractionToIntegerExactCode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[this] def castFractionToIntegerExactCode( | |
private[this] def castFractionToIntegralExactCode( |
* @throws ArithmeticException if checkOverflow is true and | ||
* the decimal too big to fit in Long type. | ||
*/ | ||
def toLong(checkOverflow: Boolean): Long = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of duplicated code doing this and an additional function call which may be avoided. Can't we just add a boolean with a default value to the existing functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then all the other function calls toLong
should be at least add parentheses as toLong()
. Since Scala only allows Arity-0 https://docs.scala-lang.org/style/method-invocation.html#arity-0 to omit parentheses.
My two concerns here:
- The existing external code calling
Decimal.toLong
will fail - The usage will be different from the trait
Numeric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, renaming to toLongExact
won't accurate either. For example, toLongExact(1.1)
should be false, while we are actually doing rounding in toLong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for 2, anyway, the usage is already different and here we're not in a Numeric
-like class. On 1, I am not sure it is a problem. Decimal
is Unstable
and this patch will go in Spark 3.0, so it is a major release (best place where to put a breaking change!). And the benefit to avoid extra function calls and a lot of duplicated code is worth the change IMHO.
cc @cloud-fan @maropu for their opinion on this too, but I feel quite strong about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am not super comfortable with the code changes in Decimal.scala here.
I did this to address comments in #25461 (comment) .
How about just adding a new method roundToLong
. The name is same as
https://github.com/google/guava/blob/master/guava/src/com/google/common/math/DoubleMath.java#L156 . So that we can leave toLong as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't break public classes just to make it easier to write code in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the additional function call by codegen. We can remove def toLong(checkOverflow: Boolean): Long
, and in the codegen:
if (nullOnOverFlow) code"decimal.toLong" else code"decimal.roundToLong"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @cloud-fan 's suggestion. At least we remove the extra method call.
import scala.math.Ordering | ||
|
||
import org.apache.spark.sql.types.Decimal.DecimalIsConflicted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I am wondering about moving it here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean moving DecimalIsConflicted
into numerics.scala
? I think it is fine keeping it in Decimal.scala
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant that. Not a big deal, just to have everything colocated. We can also do in another PR
Test build #109596 has finished for PR 25461 at commit
|
Test build #109601 has finished for PR 25461 at commit
|
buildConf("spark.sql.arithmeticOperations.failOnOverFlow") | ||
.doc("If it is set to true, all arithmetic operations on non-decimal fields throw an " + | ||
val FAIL_ON_INTEGER_OVERFLOW = | ||
buildConf("spark.sql.failOnIntegerOverFlow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failOnIntegerOverFlow -> failOnIntegralTypeOverFlow? To me, Integer is a bit ambiguous.
@@ -258,6 +258,7 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String | |||
|
|||
private lazy val dateFormatter = DateFormatter() | |||
private lazy val timestampFormatter = TimestampFormatter.getFractionFormatter(zoneId) | |||
private val failOnIntegerOverflow = SQLConf.get.failOnIntegralTypeOverflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be consistent, shall we also rename it to failOnIntegralTypeOverflow
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice and I have no comment now except for the current left ones.
Test build #109617 has finished for PR 25461 at commit
|
Test build #109628 has finished for PR 25461 at commit
|
Test build #109635 has finished for PR 25461 at commit
|
Test build #109636 has finished for PR 25461 at commit
|
thanks, merging to master! |
@cloud-fan @maropu @mgaido91 Thanks for the review! |
What changes were proposed in this pull request?
To follow ANSI SQL, we should support a configurable mode that throws exceptions when casting to integers causes overflow.
The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, which throws exceptions on arithmetical operation overflow.
To unify it, the configuration is renamed from "spark.sql.arithmeticOperations.failOnOverFlow" to "spark.sql.failOnIntegerOverFlow"
How was this patch tested?
Unit test