[SPARK-9673][SQL] Sample standard deviation aggregation function #8058

brkyvz · 2015-08-09T06:14:00Z

This PR adds the sample standard deviation as a udf, and a grouped aggregate function for SQL. It now works for TungstenAggregation with codegen, but not for SortBasedAggregation. I have some printlns in the code for debugging purposes and need help from @yhuai @marmbrus and @rxin for this to work for both methods...

cc @mengxr

SparkQA · 2015-08-09T06:25:01Z

Test build #40261 has finished for PR 8058 at commit 27ae625.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StandardDeviation(child: Expression) extends AlgebraicAggregate

yhuai · 2015-08-09T19:04:29Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

    // The list of summary statistics to compute, in the form of expressions.
    val statistics = List[(String, Expression => Expression)](
      "count" -> Count,
      "mean" -> Average,
-      "stddev" -> stddevExpr,
+      "stddev" -> aggregate.Utils.standardDeviation,


I think it is better to call it stddev_samp because other databases have both stddev_samp and stddev_pop (population standard deviation).

yhuai · 2015-08-09T19:05:04Z

I will take a look at the failed test case when SortBasedAggregation is used.

brkyvz · 2015-08-09T19:09:35Z

My diagnosis is that count is updated for the first updateExpression. Then for calculating average, the updatedCount expression is used there with count + 1, therefore the average and moment calculations get messed up.

yhuai · 2015-08-09T20:49:27Z

I think the main problem is the way that MutableProjection is implemented. We update the mutable row while we evaluating those expressions. Once we update a value (let's say currentAvg), there is no way to get the previous value back.

yhuai · 2015-08-09T20:51:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala

+      /* currentCount = */ updatedCount,
+      /* currentAvg = */ If(IsNull(child), currentAvg, updatedAvg),
+      /* currentMk = */ If(IsNull(child),
+        currentMk, Add(currentMk, deltaX * Subtract(currentValue, updatedAvg)))


At here, deltaX means currentValue - previousAvg, right? If so, because we have already updated currentAvg, deltaX means currentValue - updatedAvg.

For now, maybe we can add a deltaX field in the buffer to let you store the value of currentValue - previousAvg to workaround the problem.

I will take a look at mutable projection and try to move field update part after evaluating expressions.

That unfortunately doesn't totally solve the problem for SortBasedAggregation, and it corrupts the result for TungstenAggregation :( Because TungstenAggregation was happy with the way things were.

brkyvz · 2015-08-10T19:11:50Z

@rxin @yhuai this is ready for review! Thanks for all the help!

SparkQA · 2015-08-10T19:31:32Z

Test build #40309 has finished for PR 8058 at commit 1175ace.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-10T19:44:57Z

Test build #40310 has finished for PR 8058 at commit 941bb9e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StandardDeviation(child: Expression) extends AlgebraicAggregate

yhuai · 2015-08-10T19:51:13Z

sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala

@@ -114,6 +116,8 @@ object QueryTest {
        Row.fromSeq(s.toSeq.map {
          case d: java.math.BigDecimal => BigDecimal(d)
          case b: Array[Byte] => b.toSeq
+          case d: Double if !d.isNaN && !d.isInfinity => 
+            BigDecimal(d).setScale(10, BigDecimal.RoundingMode.HALF_UP)


Instead of changing how we compare double values, how about we change our tests by casting results to the decimal type?

SparkQA · 2015-08-10T21:35:52Z

Test build #40329 has finished for PR 8058 at commit 34b22e8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

…ementation based on AggregateFunction2 if possible.

yhuai · 2015-08-11T03:30:31Z

@brkyvz brkyvz#4

First resolve stddev functions to Hive's GenericUDAF and then replace them to our native functions.

SparkQA · 2015-08-11T05:05:35Z

Test build #40391 has finished for PR 8058 at commit dd653a4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

brkyvz · 2015-08-11T05:56:14Z

retest this please

SparkQA · 2015-08-11T06:02:45Z

Test build #40409 has finished for PR 8058 at commit 8994605.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

SparkQA · 2015-08-11T06:37:04Z

Test build #40416 has finished for PR 8058 at commit 4a83f75.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

brkyvz · 2015-08-11T07:01:01Z

retest this please

brkyvz · 2015-08-11T07:10:09Z

retest this please

SparkQA · 2015-08-11T14:41:36Z

Test build #1450 has finished for PR 8058 at commit 3e8c462.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

SparkQA · 2015-08-11T17:02:35Z

Test build #1451 has finished for PR 8058 at commit 48fa619.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

mengxr · 2015-08-11T17:32:46Z

Do we want to support population variance? I don't think it is necessary to make two methods. R supports only sample variance, which is sufficient. It would be simpler if we implement sample variance first and then wrap stddev as its square root.

brkyvz · 2015-08-11T17:36:10Z

Since most database systems do, I think we have to support it as well, since it's pretty simple to go from one or the other on our side

Make describe work in SQLContext.

SparkQA · 2015-08-11T18:16:01Z

Test build #40486 has finished for PR 8058 at commit 29cc149.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-11T20:21:52Z

Test build #40490 has finished for PR 8058 at commit a170f43.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StandardDeviation(child: Expression, sample: Boolean) extends AlgebraicAggregate

davies · 2015-09-29T20:33:09Z

@brkyvz Since #6297 is merged, do you mind to close this PR? thanks!

unbiased standard deviation aggregation function

27ae625

yhuai reviewed Aug 9, 2015
View reviewed changes

brkyvz added 4 commits August 10, 2015 10:11

save changes

0a6d2c0

Merge branch 'master' of github.com:apache/spark into sdev-udaf

315a271

fixed test

3c11734

remove unnecessary import

1175ace

brkyvz changed the title ~~[WIP][SPARK-9673][SQL] Sample standard deviation aggregation function~~ [SPARK-9673][SQL] Sample standard deviation aggregation function Aug 10, 2015

remove nan and infinity for checkAnswer

941bb9e

yhuai reviewed Aug 10, 2015
View reviewed changes

addressed comments

34b22e8

yhuai added 2 commits August 10, 2015 20:14

First resolve stddev to Hive's UDAF and replace it to our native impl…

9e6ac9d

…ementation based on AggregateFunction2 if possible.

Also update the simpleString of aggregate operators.

9f24d5e

Merge pull request #4 from yhuai/sdev-udaf

dd653a4

First resolve stddev functions to Hive's GenericUDAF and then replace them to our native functions.

brkyvz added 2 commits August 10, 2015 22:11

delete spaces

91dc106

change to defs

8994605

tried to fix scalastyle

4a83f75

added space

3e8c462

locally scalastyle passes

48fa619

yhuai and others added 2 commits August 11, 2015 11:03

Make describe work in SQLContext.

f221d2a

Merge pull request #5 from yhuai/sdev-udaf

29cc149

Make describe work in SQLContext.

fix long line

a170f43

brkyvz closed this Sep 29, 2015

brkyvz deleted the sdev-udaf branch February 3, 2019 20:55

[SPARK-9673][SQL] Sample standard deviation aggregation function #8058

[SPARK-9673][SQL] Sample standard deviation aggregation function #8058

Uh oh!

Conversation

brkyvz commented Aug 9, 2015

Uh oh!

SparkQA commented Aug 9, 2015

Uh oh!

yhuai Aug 9, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai commented Aug 9, 2015

Uh oh!

brkyvz commented Aug 9, 2015

Uh oh!

yhuai commented Aug 9, 2015

Uh oh!

yhuai Aug 9, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Aug 9, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Aug 9, 2015

Choose a reason for hiding this comment

Uh oh!

brkyvz Aug 10, 2015

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

yhuai Aug 10, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

yhuai commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

brkyvz commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

brkyvz commented Aug 11, 2015

Uh oh!

brkyvz commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

mengxr commented Aug 11, 2015

Uh oh!

brkyvz commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

SparkQA commented Aug 11, 2015

Uh oh!

davies commented Sep 29, 2015

Uh oh!

Uh oh!