Skip to content

[SPARK-9673][SQL] Sample standard deviation aggregation function #8058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Aug 9, 2015

This PR adds the sample standard deviation as a udf, and a grouped aggregate function for SQL. It now works for TungstenAggregation with codegen, but not for SortBasedAggregation. I have some printlns in the code for debugging purposes and need help from @yhuai @marmbrus and @rxin for this to work for both methods...

cc @mengxr

@SparkQA
Copy link

SparkQA commented Aug 9, 2015

Test build #40261 has finished for PR 8058 at commit 27ae625.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StandardDeviation(child: Expression) extends AlgebraicAggregate

// The list of summary statistics to compute, in the form of expressions.
val statistics = List[(String, Expression => Expression)](
"count" -> Count,
"mean" -> Average,
"stddev" -> stddevExpr,
"stddev" -> aggregate.Utils.standardDeviation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to call it stddev_samp because other databases have both stddev_samp and stddev_pop (population standard deviation).

@yhuai
Copy link
Contributor

yhuai commented Aug 9, 2015

I will take a look at the failed test case when SortBasedAggregation is used.

@brkyvz
Copy link
Contributor Author

brkyvz commented Aug 9, 2015

My diagnosis is that count is updated for the first updateExpression. Then for calculating average, the updatedCount expression is used there with count + 1, therefore the average and moment calculations get messed up.

@yhuai
Copy link
Contributor

yhuai commented Aug 9, 2015

I think the main problem is the way that MutableProjection is implemented. We update the mutable row while we evaluating those expressions. Once we update a value (let's say currentAvg), there is no way to get the previous value back.

/* currentCount = */ updatedCount,
/* currentAvg = */ If(IsNull(child), currentAvg, updatedAvg),
/* currentMk = */ If(IsNull(child),
currentMk, Add(currentMk, deltaX * Subtract(currentValue, updatedAvg)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At here, deltaX means currentValue - previousAvg, right? If so, because we have already updated currentAvg, deltaX means currentValue - updatedAvg.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, maybe we can add a deltaX field in the buffer to let you store the value of currentValue - previousAvg to workaround the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look at mutable projection and try to move field update part after evaluating expressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That unfortunately doesn't totally solve the problem for SortBasedAggregation, and it corrupts the result for TungstenAggregation :( Because TungstenAggregation was happy with the way things were.

@brkyvz brkyvz changed the title [WIP][SPARK-9673][SQL] Sample standard deviation aggregation function [SPARK-9673][SQL] Sample standard deviation aggregation function Aug 10, 2015
@brkyvz
Copy link
Contributor Author

brkyvz commented Aug 10, 2015

@rxin @yhuai this is ready for review! Thanks for all the help!

@SparkQA
Copy link

SparkQA commented Aug 10, 2015

Test build #40309 has finished for PR 8058 at commit 1175ace.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2015

Test build #40310 has finished for PR 8058 at commit 941bb9e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StandardDeviation(child: Expression) extends AlgebraicAggregate

@@ -114,6 +116,8 @@ object QueryTest {
Row.fromSeq(s.toSeq.map {
case d: java.math.BigDecimal => BigDecimal(d)
case b: Array[Byte] => b.toSeq
case d: Double if !d.isNaN && !d.isInfinity =>
BigDecimal(d).setScale(10, BigDecimal.RoundingMode.HALF_UP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing how we compare double values, how about we change our tests by casting results to the decimal type?

@SparkQA
Copy link

SparkQA commented Aug 10, 2015

Test build #40329 has finished for PR 8058 at commit 34b22e8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

@yhuai
Copy link
Contributor

yhuai commented Aug 11, 2015

@brkyvz brkyvz#4

First resolve stddev functions to Hive's GenericUDAF and then replace them to our native functions.
@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40391 has finished for PR 8058 at commit dd653a4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

@brkyvz
Copy link
Contributor Author

brkyvz commented Aug 11, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40409 has finished for PR 8058 at commit 8994605.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40416 has finished for PR 8058 at commit 4a83f75.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

@brkyvz
Copy link
Contributor Author

brkyvz commented Aug 11, 2015

retest this please

1 similar comment
@brkyvz
Copy link
Contributor Author

brkyvz commented Aug 11, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #1450 has finished for PR 8058 at commit 3e8c462.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #1451 has finished for PR 8058 at commit 48fa619.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

@mengxr
Copy link
Contributor

mengxr commented Aug 11, 2015

Do we want to support population variance? I don't think it is necessary to make two methods. R supports only sample variance, which is sufficient. It would be simpler if we implement sample variance first and then wrap stddev as its square root.

@brkyvz
Copy link
Contributor Author

brkyvz commented Aug 11, 2015

Since most database systems do, I think we have to support it as well, since it's pretty simple to go from one or the other on our side

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40486 has finished for PR 8058 at commit 29cc149.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40490 has finished for PR 8058 at commit a170f43.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

@davies
Copy link
Contributor

davies commented Sep 29, 2015

@brkyvz Since #6297 is merged, do you mind to close this PR? thanks!

@brkyvz brkyvz closed this Sep 29, 2015
@brkyvz brkyvz deleted the sdev-udaf branch February 3, 2019 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants