[SPARK-28741][SQL]Optional mode: throw exceptions when casting to integers causes overflow #25461

gengliangwang · 2019-08-15T07:48:29Z

What changes were proposed in this pull request?

To follow ANSI SQL, we should support a configurable mode that throws exceptions when casting to integers causes overflow.
The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, which throws exceptions on arithmetical operation overflow.
To unify it, the configuration is renamed from "spark.sql.arithmeticOperations.failOnOverFlow" to "spark.sql.failOnIntegerOverFlow"

How was this patch tested?

Unit test

SparkQA · 2019-08-15T07:58:36Z

Test build #109151 has finished for PR 25461 at commit ce1e1b5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-08-15T16:30:48Z

retest this please.

HyukjinKwon · 2019-08-16T00:41:39Z

retest this please

maropu · 2019-08-16T01:12:54Z

Jenkins is sleeping now? http://apache-spark-developers-list.1001551.n3.nabble.com/build-system-colo-maintenance-amp-outage-tomorrow-10am-2pm-PDT-td27666.html

SparkQA · 2019-08-16T06:21:57Z

Test build #109156 has finished for PR 25461 at commit a9e7477.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-16T07:05:01Z

Test build #109178 has finished for PR 25461 at commit 0e8e7fb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-08-16T07:16:15Z

retest this please.

gengliangwang · 2019-08-16T07:16:26Z

Also cc @cloud-fan @mgaido91

SparkQA · 2019-08-16T07:24:52Z

Test build #109196 has started for PR 25461 at commit 0e8e7fb.

SparkQA · 2019-08-16T12:08:19Z

Test build #109193 has finished for PR 25461 at commit 0e8e7fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala

cloud-fan · 2019-08-16T14:26:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+
+  private[this] def castDecimalToIntegerCode(
+      ctx: CodegenContext,
+      intType: String): CastFunction = {


intType? Do you mean inType?

It's actually integerType. But the full name makes some code longer than 100 characters.
I can change it if you think it is misleading.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

maropu · 2019-08-18T01:41:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+    case x: NumericType if failOnIntegerOverflow =>
+      b =>
+        val intValue = try {
+          x.exactNumeric.asInstanceOf[Numeric[Any]].toInt(b)


Why do you cast it into int once?

The trait Numeric doesn't have the method toInt. Before this code change, the value is also casted to int.

We cannot check the valid value range in a single place instead of the current two checks in line 520 and 525?

Well, we can do it by match it case by case. Then the code is a bit long. Casting to short/byte should be minor usage. Also, The previous code also cast to Int before cast to Short.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

maropu · 2019-08-18T12:48:34Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

+        checkEvaluation(cast(Literal(value, TimestampType), LongType),
+          Math.floorDiv(value, MICROS_PER_SECOND))
+      }
+      checkEvaluation(cast(9223372036854775807.9f, LongType), 9223372036854775807L)


How about doing boundary tests like this?

checkEvaluation(cast(java.lang.Math.nextDown(9223372036854775807.9f), LongType), 9223372036854775807L) --> non-overflow case checkEvaluation(cast(java.lang.Math.nextUp(9223372036854775807.9f), LongType), 9223372036854775807L) --> overflow case

Why would we do that?

scala> java.lang.Math.nextDown(9223372036854775807.9D) < 9223372036854775807.9D res23: Boolean = true

Ah, its ok to do it like this instead;

checkEvaluation(cast(9223372036854775807.9f, LongType), 9223372036854775807L) --> non-overflow case checkEvaluation(cast(java.lang.Math.nextUp(9223372036854775807.9f), LongType), 9223372036854775807L) --> overflow case

What I'm a little worried about is that 9223372036854775807.9f is implicitly truncated (to 9223372036854776000.0f?) by a compiler because it cannot be packed in the float IEEE754 format as you said before. So, IIUC the test is actually the same with cast(9223372036854776000.0f, LongType)?

What I understand is as follows(sorted by values desc) and is this correct?

IEEE754 continuous float values ------------------------------------------ overflow case: 9223373136366404000.0f <--- Math.nextUp(9223372036854775807.9f) non-overflow case: 9223372036854776000.0f <--- 9223372036854775807.9f non-overflow case: 9223371487098961900.0f <--- Math.nextDown(9223372036854775807.9f)

@maropu Yes, I think you are right

Thanks for the check!

SparkQA · 2019-08-18T18:02:36Z

Test build #109297 has finished for PR 25461 at commit 721c4f2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-08-18T18:50:49Z

retest this please.

SparkQA · 2019-08-18T22:36:16Z

Test build #109300 has finished for PR 25461 at commit 721c4f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-22T13:59:13Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

+
+  test("Cast to byte with option FAIL_ON_INTEGER_OVERFLOW enabled") {
+    withSQLConf(SQLConf.FAIL_ON_INTEGER_OVERFLOW.key -> "true") {
+      testIntMaxAndMin(ByteType)


why we need to test cast int.max +1 to byte? I think it's good enough to test cast byte.max +1 to byte

I think it is always good to have more test cases here if it doesn't increase the testing time by a few seconds.
For example,
If casting double to byte is implemented as:

val x = doubleValue.toShort if (x.toByte == x) { x.toByte } else { throw new ... }

We can find that it is wrong with this test case, because

(Int.MaxValue+1.0).toShort.toByte == (Int.MaxValue+1.0).toShort

is true.

cloud-fan · 2019-08-22T13:59:23Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

+
+  test("Cast to short with option FAIL_ON_INTEGER_OVERFLOW enabled") {
+    withSQLConf(SQLConf.FAIL_ON_INTEGER_OVERFLOW.key -> "true") {
+      testIntMaxAndMin(ShortType)


cloud-fan · 2019-08-22T14:01:16Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

+      }
+      checkEvaluation(cast(9223372036854775807.9f, LongType), 9223372036854775807L)
+      checkEvaluation(cast(-9223372036854775808.9f, LongType), -9223372036854775808L)
+      checkEvaluation(cast(9223372036854775807.9D, LongType), 9223372036854775807L)


how about checkEvaluation(cast(0.9D + Long.Max, LongType), Long.Max)?

cloud-fan · 2019-08-22T14:01:37Z

LGTM except a few minor comments

mgaido91 · 2019-08-22T13:31:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -474,8 +477,12 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String
      buildCast[Boolean](_, b => if (b) 1 else 0)
    case DateType =>
      buildCast[Int](_, d => null)
+    case TimestampType if failOnIntegerOverflow =>


do we really need this? AFAIK a timestamp cannot overflow, can it?

It is possible in theory.

mgaido91 · 2019-08-22T13:35:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -1182,6 +1233,78 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String
      (c, evPrim, evNull) => code"$evPrim = $c != 0;"
  }

+  private[this] def castTimestampToIntegerCode(


Suggested change

private[this] def castTimestampToIntegerCode(

private[this] def castTimestampToIntegralCode(

Integral is adjective. "Integral code" seems weird to me.
We can call it castTimestampToIntegralTypeCode if you insist. I didn't use it because the name is a bit long.

well, Integer is a specific data type, so I think this name is misleading...your suggested one is fine to me

mgaido91 · 2019-08-22T13:35:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+    (c, evPrim, evNull) => code"$evPrim = $c.to${integralType.capitalize}($failOnIntegerOverflow);"
+  }
+
+  private[this] def castIntegerToIntegerExactCode(integralType: String): CastFunction = {


Suggested change

private[this] def castIntegerToIntegerExactCode(integralType: String): CastFunction = {

private[this] def castIntegerToIntegralExactCode(integralType: String): CastFunction = {

(oh, I didn't know this functionality, ```suggestion, cool)

mgaido91 · 2019-08-22T13:35:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+    (min.toString + typeIndicator, max.toString + typeIndicator)
+  }
+
+  private[this] def castFractionToIntegerExactCode(


Suggested change

private[this] def castFractionToIntegerExactCode(

private[this] def castFractionToIntegralExactCode(

mgaido91 · 2019-08-22T13:43:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

+   * @throws ArithmeticException if checkOverflow is true and
+   *                             the decimal too big to fit in Long type.
+   */
+  def toLong(checkOverflow: Boolean): Long = {


a lot of duplicated code doing this and an additional function call which may be avoided. Can't we just add a boolean with a default value to the existing functions?

Then all the other function calls toLong should be at least add parentheses as toLong(). Since Scala only allows Arity-0 https://docs.scala-lang.org/style/method-invocation.html#arity-0 to omit parentheses.
My two concerns here:

The existing external code calling Decimal.toLong will fail

The usage will be different from the trait Numeric

BTW, renaming to toLongExact won't accurate either. For example, toLongExact(1.1) should be false, while we are actually doing rounding in toLong.

for 2, anyway, the usage is already different and here we're not in a Numeric-like class. On 1, I am not sure it is a problem. Decimal is Unstable and this patch will go in Spark 3.0, so it is a major release (best place where to put a breaking change!). And the benefit to avoid extra function calls and a lot of duplicated code is worth the change IMHO.

cc @cloud-fan @maropu for their opinion on this too, but I feel quite strong about this.

Actually, I am not super comfortable with the code changes in Decimal.scala here.
I did this to address comments in #25461 (comment) .
How about just adding a new method roundToLong. The name is same as
https://github.com/google/guava/blob/master/guava/src/com/google/common/math/DoubleMath.java#L156 . So that we can leave toLong as it is.

We can't break public classes just to make it easier to write code in Spark.

I think we can remove the additional function call by codegen. We can remove def toLong(checkOverflow: Boolean): Long, and in the codegen:

if (nullOnOverFlow) code"decimal.toLong" else code"decimal.roundToLong"

+1 for @cloud-fan 's suggestion. At least we remove the extra method call.

mgaido91 · 2019-08-22T14:02:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala

 import scala.math.Ordering

+import org.apache.spark.sql.types.Decimal.DecimalIsConflicted


nit: I am wondering about moving it here...

Do you mean moving DecimalIsConflicted into numerics.scala? I think it is fine keeping it in Decimal.scala for now.

Yes, I meant that. Not a big deal, just to have everything colocated. We can also do in another PR

SparkQA · 2019-08-23T03:30:00Z

Test build #109596 has finished for PR 25461 at commit af48226.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-23T04:26:31Z

Test build #109601 has finished for PR 25461 at commit 977d509.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-08-23T04:29:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

-    buildConf("spark.sql.arithmeticOperations.failOnOverFlow")
-      .doc("If it is set to true, all arithmetic operations on non-decimal fields throw an " +
+  val FAIL_ON_INTEGER_OVERFLOW =
+    buildConf("spark.sql.failOnIntegerOverFlow")


failOnIntegerOverFlow -> failOnIntegralTypeOverFlow? To me, Integer is a bit ambiguous.

cloud-fan · 2019-08-23T06:55:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -258,6 +258,7 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String

  private lazy val dateFormatter = DateFormatter()
  private lazy val timestampFormatter = TimestampFormatter.getFractionFormatter(zoneId)
+  private val failOnIntegerOverflow = SQLConf.get.failOnIntegralTypeOverflow


to be consistent, shall we also rename it to failOnIntegralTypeOverflow?

maropu

Looks nice and I have no comment now except for the current left ones.

SparkQA · 2019-08-23T07:05:01Z

Test build #109617 has finished for PR 25461 at commit f1c64e1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-23T11:41:41Z

Test build #109628 has finished for PR 25461 at commit 3c49a9b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-23T12:38:17Z

Test build #109635 has finished for PR 25461 at commit 86d3e8d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-23T12:52:40Z

Test build #109636 has finished for PR 25461 at commit 9f4002c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-23T13:50:13Z

thanks, merging to master!

gengliangwang · 2019-08-23T15:05:29Z

@cloud-fan @maropu @mgaido91 Thanks for the review!

throw exception on integer overflow

ce1e1b5

gengliangwang mentioned this pull request Aug 15, 2019

[SPARK-28503][SQL] Return null result on cast an out-of-range value to a integral type #25300

Closed

gengliangwang added 3 commits August 15, 2019 16:23

fix build

12162b8

update

b670b1a

refactor

35c9c55

dongjoon-hyun added the SQL label Aug 15, 2019

remove unneeded changes

a9e7477

revise

0e8e7fb

cloud-fan reviewed Aug 16, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 16, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 16, 2019

View reviewed changes

maropu reviewed Aug 18, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

maropu reviewed Aug 18, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

maropu reviewed Aug 18, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

maropu reviewed Aug 18, 2019

View reviewed changes

address comments

721c4f2

cloud-fan reviewed Aug 22, 2019

View reviewed changes

mgaido91 reviewed Aug 22, 2019

View reviewed changes

gengliangwang added 2 commits August 23, 2019 06:56

address more comments

af48226

revise method names

977d509

maropu reviewed Aug 23, 2019

View reviewed changes

more renaming

f1c64e1

cloud-fan reviewed Aug 23, 2019

View reviewed changes

maropu approved these changes Aug 23, 2019

View reviewed changes

gengliangwang added 4 commits August 23, 2019 15:12

one more rename

3c49a9b

revise

e395b23

rename methods

86d3e8d

revise comment

9f4002c

cloud-fan approved these changes Aug 23, 2019

View reviewed changes

mgaido91 approved these changes Aug 23, 2019

View reviewed changes

gengliangwang changed the title ~~[SPARK-28741][SQL]Throw exceptions when casting to integers causes overflow~~ [SPARK-28741][SQL]Optional mode: throw exceptions when casting to integers causes overflow Aug 23, 2019

cloud-fan closed this in 8258660 Aug 23, 2019

	private[this] def castTimestampToIntegerCode(
	private[this] def castTimestampToIntegralCode(

	private[this] def castIntegerToIntegerExactCode(integralType: String): CastFunction = {
	private[this] def castIntegerToIntegralExactCode(integralType: String): CastFunction = {

	private[this] def castFractionToIntegerExactCode(
	private[this] def castFractionToIntegralExactCode(

		import scala.math.Ordering

		import org.apache.spark.sql.types.Decimal.DecimalIsConflicted

[SPARK-28741][SQL]Optional mode: throw exceptions when casting to integers causes overflow #25461

[SPARK-28741][SQL]Optional mode: throw exceptions when casting to integers causes overflow #25461

Conversation

gengliangwang commented Aug 15, 2019

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 15, 2019

gengliangwang commented Aug 15, 2019

HyukjinKwon commented Aug 16, 2019

maropu commented Aug 16, 2019

SparkQA commented Aug 16, 2019

SparkQA commented Aug 16, 2019

gengliangwang commented Aug 16, 2019

gengliangwang commented Aug 16, 2019

SparkQA commented Aug 16, 2019

SparkQA commented Aug 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 18, 2019

gengliangwang commented Aug 18, 2019

SparkQA commented Aug 18, 2019

Choose a reason for hiding this comment

gengliangwang Aug 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Aug 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 23, 2019

SparkQA commented Aug 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu left a comment

Choose a reason for hiding this comment

SparkQA commented Aug 23, 2019

SparkQA commented Aug 23, 2019

SparkQA commented Aug 23, 2019

SparkQA commented Aug 23, 2019

cloud-fan commented Aug 23, 2019

gengliangwang commented Aug 23, 2019

gengliangwang Aug 22, 2019 •

edited

Loading

cloud-fan Aug 23, 2019 •

edited

Loading