-
Notifications
You must be signed in to change notification settings - Fork 323
[SPARKR-92] Phase 1: implement mean(rdd), stdev(rdd), variance(rdd) #241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think the current approach is a little bit hard to maintain. Should we use some lightweight OOP for StatCounter? That way We could move the design discussion to JIRA. |
The functions "sampleStdev" and "sampleVariance" also could be implemented with only a few lines codes via reusing mergeStats() in this patch. Moreover, the "histogram" function maybe require a lot of additional codes, which are not in StatCounter, like in pyspark. Thus, I suppose a OOP for StatCounter maybe not very necessary at this stage. |
If you want a more generic |
@@ -419,3 +419,21 @@ cleanClosure <- function(func, checkedFuncs = new.env()) { | |||
} | |||
func | |||
} | |||
|
|||
# Merge another StatCounter into this one, adding up the internal statistics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use some comments here as to what structure x
and y
have (i.e. what is x[1] and x[2]). Also a light-weight method to get OOP is to use named lists like list(sum=x1, sumsq=x2)
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on named list
@hqzizania Would it be possible to re-open this PR against the main Spark repo now ? I did plan to do a automatic move for already open PRs but I feel we have some more things to discuss for this one (like @davies suggestion of using DataFrames). I've already created a JIRA on the Spark side at Let me know if this sounds good. |
@shivaram I've opened it against the main Spark repo. But the fixes, improvements and replies to suggestions could be delayed because I have no time at this moment. |
@hqzizania - Thanks for the heads up. We can first discuss the design choices and then come to the implementation. I'm going to close the PR here (we can backport it later though I guess its not required as this is a new feature rather than a bug fix) |
Moving here from amplab-extras/SparkR-pkg#241 sum() has been implemented. (amplab-extras/SparkR-pkg#242) Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841 Author: qhuang <qian.huang@intel.com> Closes #5446 from hqzizania/R and squashes the following commits: f283572 [qhuang] add test unit for describe() 2e74d5a [qhuang] add describe() DataFrame API (cherry picked from commit a466944) Signed-off-by: Reynold Xin <rxin@databricks.com>
Moving here from amplab-extras/SparkR-pkg#241 sum() has been implemented. (amplab-extras/SparkR-pkg#242) Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841 Author: qhuang <qian.huang@intel.com> Closes #5446 from hqzizania/R and squashes the following commits: f283572 [qhuang] add test unit for describe() 2e74d5a [qhuang] add describe() DataFrame API
Moving here from amplab-extras/SparkR-pkg#241 sum() has been implemented. (amplab-extras/SparkR-pkg#242) Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841 Author: qhuang <qian.huang@intel.com> Closes apache#5446 from hqzizania/R and squashes the following commits: f283572 [qhuang] add test unit for describe() 2e74d5a [qhuang] add describe() DataFrame API
Moving here from amplab-extras/SparkR-pkg#241 sum() has been implemented. (amplab-extras/SparkR-pkg#242) Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841 Author: qhuang <qian.huang@intel.com> Closes apache#5446 from hqzizania/R and squashes the following commits: f283572 [qhuang] add test unit for describe() 2e74d5a [qhuang] add describe() DataFrame API
Moving here from amplab-extras/SparkR-pkg#241 sum() has been implemented. (amplab-extras/SparkR-pkg#242) Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841 Author: qhuang <qian.huang@intel.com> Closes apache#5446 from hqzizania/R and squashes the following commits: f283572 [qhuang] add test unit for describe() 2e74d5a [qhuang] add describe() DataFrame API
No description provided.