Skip to content

Conversation

mickjermsurawong-stripe
Copy link
Contributor

@mickjermsurawong-stripe mickjermsurawong-stripe commented Jul 2, 2019

What changes were proposed in this pull request?

  • Currently sum in aggregates for decimal type can overflow and return null.

    • Sum expression codegens arithmetic on sql.Decimal and the output which preserves scale and precision goes into UnsafeRowWriter. Here overflowing will be converted to null when writing out.
    • It also does not go through this branch in DecimalAggregates because it's expecting precision of the sum (not the elements to be summed) to be less than 5.
      case ae @ AggregateExpression(af, _, _, _) => af match {
      case Sum(e @ DecimalType.Expression(prec, scale)) if prec + 10 <= MAX_LONG_DIGITS =>
      MakeDecimal(ae.copy(aggregateFunction = Sum(UnscaledValue(e))), prec + 10, scale)
  • This PR adds the check at the final result of the sum operator itself.

    /**
    * An expression which returns the final value for this aggregate function. Its data type should
    * match this expression's [[dataType]].
    */
    val evaluateExpression: Expression

https://issues.apache.org/jira/browse/SPARK-28224

How was this patch tested?

  • Added an integration test on dataframe suite

cc @mgaido91 @JoshRosen

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me. May you please also check that SQLServer behaves in the same way? Ie. that it throws an exception in this specific case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq(true, false).foreach {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use resultType I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataType actually calls resultType but the latter reads better for our purpose here. Will update so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry, I meant something like this:

    case _: DecimalType =>
      CheckOverflow(sum, resultType, SQLConf.get.decimalOperationsNullOnOverflow)

In this way we do not create a new DecimalType and I think it is more readable... Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to using matched case because resultType is still a generic DataType and not Decimal type

  override lazy val evaluateExpression: Expression = resultType match {
    case d: DecimalType => CheckOverflow(sum, d, SQLConf.get.decimalOperationsNullOnOverflow)
    case _ => sum
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the convention to how other dataset in this file is created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, if others think it is ok to leave it I am fine with it, I'd remove it since it is useless honestly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way and would also defer to others. Although i do see value in consistency here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I personally think we don't need these changes in this pr, too.

@mgaido91
Copy link
Contributor

mgaido91 commented Jul 2, 2019

PS Please add [SQL] to the PR title.

Honestly, I don't consider this a followup to #25010. This was actually a bug leading to weird behaviors, such as show prints an output, while collect returns null. And this is independent from the PR you linked.

@mickjermsurawong-stripe mickjermsurawong-stripe changed the title [SPARK-28224] Aggregate sum large decimals [SPARK-28224][SQL] Aggregate sum large decimals Jul 2, 2019
@dongjoon-hyun
Copy link
Member

ok to test

@mickjermsurawong-stripe
Copy link
Contributor Author

@mgaido91, thank you for the review! I made the updates.
To your original question on SQL server behavior, there seem to be different outcome among sql versions. From fiddle here:
Sql5.6 would arbitrarily loose precision even on integral parts with rounding. [0]
Sql MS Sql Server107 would show error Data truncation [1]

Aiming for a least-surprise behavior, do you think sum here should be treated the same as other arithmetic operations on fixed precisions?

A larger question here is perhaps if there are other operations on fixed precision that could result in overflow. And supposed we are okay with this PR fix, does it mean we should patch for all such operators? A non-issue here is pow operator with Double type.

[0] http://sqlfiddle.com/#!9/676c31/1
[1] http://sqlfiddle.com/#!18/0baa5/1

@SparkQA
Copy link

SparkQA commented Jul 5, 2019

Test build #107252 has finished for PR 25033 at commit 01174be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

mgaido91 commented Jul 5, 2019

Thanks for your work. Regarding SQLServer, I also checked its behaviour. The main difference with our current behaviour is that it always returns the max precision, instead of our formula 10 + precision. So I submitted a PR for that.

LGTM

@SparkQA
Copy link

SparkQA commented Jul 5, 2019

Test build #107258 has finished for PR 25033 at commit b668d6f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mickjermsurawong-stripe
Copy link
Contributor Author

jenkins retest this please

@JoshRosen JoshRosen changed the title [SPARK-28224][SQL] Aggregate sum large decimals [SPARK-28224][SQL] Check overflow in decimal Sum aggregate Jul 9, 2019
@JoshRosen
Copy link
Contributor

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jul 10, 2019

Test build #107419 has finished for PR 25033 at commit 4094335.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mickjermsurawong-stripe
Copy link
Contributor Author

hi @dongjoon-hyun, @cloud-fan. Just would like to follow-up on this PR please.

}
}

test("Aggregate sum integers") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the two tests below for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu the change introduces branching for BigDecimal case, so I want to verify that sum still works for integer/double case.

Copy link
Member

@maropu maropu Aug 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the case for int/double is an existing path in master, so we already have tests for that somewhere, right? I personally think the test above is enough for this pr.

@SparkQA
Copy link

SparkQA commented Aug 17, 2019

Test build #109280 has finished for PR 25033 at commit 67a956e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one minor comment.

@SparkQA
Copy link

SparkQA commented Aug 19, 2019

Test build #109312 has finished for PR 25033 at commit 4497576.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

df
}

protected lazy val largeDecimals: DataFrame = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLTestData should only contain data that are widely used by many tests. Let's move the largeDecimals to its test case.

@SparkQA
Copy link

SparkQA commented Aug 19, 2019

Test build #109353 has finished for PR 25033 at commit 49c97d8.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2019

Test build #109356 has finished for PR 25033 at commit db2d9a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu closed this in b79cf0d Aug 20, 2019
@maropu
Copy link
Member

maropu commented Aug 20, 2019

Thanks! Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants