-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-9673][SQL] Sample standard deviation aggregation function #8058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #40261 has finished for PR 8058 at commit
|
// The list of summary statistics to compute, in the form of expressions. | ||
val statistics = List[(String, Expression => Expression)]( | ||
"count" -> Count, | ||
"mean" -> Average, | ||
"stddev" -> stddevExpr, | ||
"stddev" -> aggregate.Utils.standardDeviation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to call it stddev_samp
because other databases have both stddev_samp
and stddev_pop
(population standard deviation).
I will take a look at the failed test case when SortBasedAggregation is used. |
My diagnosis is that count is updated for the first updateExpression. Then for calculating average, the updatedCount expression is used there with count + 1, therefore the average and moment calculations get messed up. |
I think the main problem is the way that |
/* currentCount = */ updatedCount, | ||
/* currentAvg = */ If(IsNull(child), currentAvg, updatedAvg), | ||
/* currentMk = */ If(IsNull(child), | ||
currentMk, Add(currentMk, deltaX * Subtract(currentValue, updatedAvg))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At here, deltaX
means currentValue - previousAvg
, right? If so, because we have already updated currentAvg
, deltaX
means currentValue - updatedAvg
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, maybe we can add a deltaX
field in the buffer to let you store the value of currentValue - previousAvg
to workaround the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look at mutable projection and try to move field update part after evaluating expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That unfortunately doesn't totally solve the problem for SortBasedAggregation, and it corrupts the result for TungstenAggregation :( Because TungstenAggregation was happy with the way things were.
Test build #40309 has finished for PR 8058 at commit
|
Test build #40310 has finished for PR 8058 at commit
|
@@ -114,6 +116,8 @@ object QueryTest { | |||
Row.fromSeq(s.toSeq.map { | |||
case d: java.math.BigDecimal => BigDecimal(d) | |||
case b: Array[Byte] => b.toSeq | |||
case d: Double if !d.isNaN && !d.isInfinity => | |||
BigDecimal(d).setScale(10, BigDecimal.RoundingMode.HALF_UP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing how we compare double values, how about we change our tests by casting results to the decimal type?
Test build #40329 has finished for PR 8058 at commit
|
…ementation based on AggregateFunction2 if possible.
First resolve stddev functions to Hive's GenericUDAF and then replace them to our native functions.
Test build #40391 has finished for PR 8058 at commit
|
retest this please |
Test build #40409 has finished for PR 8058 at commit
|
Test build #40416 has finished for PR 8058 at commit
|
retest this please |
1 similar comment
retest this please |
Test build #1450 has finished for PR 8058 at commit
|
Test build #1451 has finished for PR 8058 at commit
|
Do we want to support population variance? I don't think it is necessary to make two methods. R supports only sample variance, which is sufficient. It would be simpler if we implement sample variance first and then wrap |
Since most database systems do, I think we have to support it as well, since it's pretty simple to go from one or the other on our side |
Make describe work in SQLContext.
Test build #40486 has finished for PR 8058 at commit
|
Test build #40490 has finished for PR 8058 at commit
|
This PR adds the sample standard deviation as a udf, and a grouped aggregate function for SQL. It now works for TungstenAggregation with codegen, but not for SortBasedAggregation. I have some
println
s in the code for debugging purposes and need help from @yhuai @marmbrus and @rxin for this to work for both methods...cc @mengxr