[SPARKR-92] Phase 1: implement mean(rdd), stdev(rdd), variance(rdd) #241

hqzizania · 2015-04-03T12:32:41Z

No description provided.

…(rdd)

concretevitamin · 2015-04-03T23:51:29Z

I think the current approach is a little bit hard to maintain. Should we use some lightweight OOP for StatCounter? That way mergeStats() would be easier to understand and easily extensible (we may also need sampleStdev() and sampleVariance() etc.).

We could move the design discussion to JIRA.

hqzizania · 2015-04-04T12:32:29Z

The functions "sampleStdev" and "sampleVariance" also could be implemented with only a few lines codes via reusing mergeStats() in this patch. Moreover, the "histogram" function maybe require a lot of additional codes, which are not in StatCounter, like in pyspark. Thus, I suppose a OOP for StatCounter maybe not very necessary at this stage.

hafen · 2015-04-05T04:35:31Z

If you want a more generic mergeStats() under the hood, you could follow as outlined in this paper: http://janinebennett.org/index_files/ParallelStatisticsAlgorithms.pdf, which provides a numerically stable algorithm for calculating any order moment. You could expose sampleMean(), sampleVariance(), even sampleKurtosis(), etc. to the user and call the same common mergeStats().

shivaram · 2015-04-06T17:04:54Z

pkg/R/utils.R

@@ -419,3 +419,21 @@ cleanClosure <- function(func, checkedFuncs = new.env()) {
  }
  func
 }
+
+# Merge another StatCounter into this one, adding up the internal statistics.


We could use some comments here as to what structure x and y have (i.e. what is x[1] and x[2]). Also a light-weight method to get OOP is to use named lists like list(sum=x1, sumsq=x2) etc.

+1 on named list

shivaram · 2015-04-10T02:09:19Z

@hqzizania Would it be possible to re-open this PR against the main Spark repo now ? I did plan to do a automatic move for already open PRs but I feel we have some more things to discuss for this one (like @davies suggestion of using DataFrames). I've already created a JIRA on the Spark side at
https://issues.apache.org/jira/browse/SPARK-6841

Let me know if this sounds good.

hqzizania · 2015-04-10T04:39:43Z

@shivaram I've opened it against the main Spark repo. But the fixes, improvements and replies to suggestions could be delayed because I have no time at this moment.

shivaram · 2015-04-10T04:43:08Z

@hqzizania - Thanks for the heads up. We can first discuss the design choices and then come to the implementation. I'm going to close the PR here (we can backport it later though I guess its not required as this is a new feature rather than a bug fix)

Moving here from amplab-extras/SparkR-pkg#241 sum() has been implemented. (amplab-extras/SparkR-pkg#242) Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841 Author: qhuang <qian.huang@intel.com> Closes #5446 from hqzizania/R and squashes the following commits: f283572 [qhuang] add test unit for describe() 2e74d5a [qhuang] add describe() DataFrame API (cherry picked from commit a466944) Signed-off-by: Reynold Xin <rxin@databricks.com>

Moving here from amplab-extras/SparkR-pkg#241 sum() has been implemented. (amplab-extras/SparkR-pkg#242) Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841 Author: qhuang <qian.huang@intel.com> Closes #5446 from hqzizania/R and squashes the following commits: f283572 [qhuang] add test unit for describe() 2e74d5a [qhuang] add describe() DataFrame API

Moving here from amplab-extras/SparkR-pkg#241 sum() has been implemented. (amplab-extras/SparkR-pkg#242) Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841 Author: qhuang <qian.huang@intel.com> Closes apache#5446 from hqzizania/R and squashes the following commits: f283572 [qhuang] add test unit for describe() 2e74d5a [qhuang] add describe() DataFrame API

[SPARKR-92] Phase 1: implement mean(rdd)

122d71d

hqzizania closed this Apr 3, 2015

[SPARKR-92] Phase 1: implement variance(rdd)

e7986ef

hqzizania reopened this Apr 3, 2015

[SPARKR-92] Phase 1: implement stdev(rdd) and minor fixes in variance…

3a9a602

…(rdd)

hqzizania changed the title ~~[SPARKR-92] Phase 1: implement mean(rdd)~~ [SPARKR-92] Phase 1: implement mean(rdd), stdev(rdd), variance(rdd) Apr 3, 2015

Add test case for stdev(rdd)

db68fea

shivaram reviewed Apr 6, 2015
View reviewed changes

hqzizania mentioned this pull request Apr 10, 2015

[SPARK-6841] [SPARKR] add support for mean, median, stdev etc. apache/spark#5446

Closed

shivaram closed this Apr 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARKR-92] Phase 1: implement mean(rdd), stdev(rdd), variance(rdd) #241

[SPARKR-92] Phase 1: implement mean(rdd), stdev(rdd), variance(rdd) #241

hqzizania commented Apr 3, 2015

concretevitamin commented Apr 3, 2015

hqzizania commented Apr 4, 2015

hafen commented Apr 5, 2015

shivaram Apr 6, 2015

concretevitamin Apr 6, 2015

shivaram commented Apr 10, 2015

hqzizania commented Apr 10, 2015

shivaram commented Apr 10, 2015

[SPARKR-92] Phase 1: implement mean(rdd), stdev(rdd), variance(rdd) #241

[SPARKR-92] Phase 1: implement mean(rdd), stdev(rdd), variance(rdd) #241

Conversation

hqzizania commented Apr 3, 2015

concretevitamin commented Apr 3, 2015

hqzizania commented Apr 4, 2015

hafen commented Apr 5, 2015

shivaram Apr 6, 2015

Choose a reason for hiding this comment

concretevitamin Apr 6, 2015

Choose a reason for hiding this comment

shivaram commented Apr 10, 2015

hqzizania commented Apr 10, 2015

shivaram commented Apr 10, 2015