Skip to content

[SPARKR-92] Phase 1: implement mean(rdd), stdev(rdd), variance(rdd) #241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

hqzizania
Copy link
Contributor

No description provided.

@hqzizania hqzizania closed this Apr 3, 2015
@hqzizania hqzizania reopened this Apr 3, 2015
@hqzizania hqzizania changed the title [SPARKR-92] Phase 1: implement mean(rdd) [SPARKR-92] Phase 1: implement mean(rdd), stdev(rdd), variance(rdd) Apr 3, 2015
@concretevitamin
Copy link
Member

I think the current approach is a little bit hard to maintain. Should we use some lightweight OOP for StatCounter? That way mergeStats() would be easier to understand and easily extensible (we may also need sampleStdev() and sampleVariance() etc.).

We could move the design discussion to JIRA.

@hqzizania
Copy link
Contributor Author

The functions "sampleStdev" and "sampleVariance" also could be implemented with only a few lines codes via reusing mergeStats() in this patch. Moreover, the "histogram" function maybe require a lot of additional codes, which are not in StatCounter, like in pyspark. Thus, I suppose a OOP for StatCounter maybe not very necessary at this stage.

@hafen
Copy link
Contributor

hafen commented Apr 5, 2015

If you want a more generic mergeStats() under the hood, you could follow as outlined in this paper: http://janinebennett.org/index_files/ParallelStatisticsAlgorithms.pdf, which provides a numerically stable algorithm for calculating any order moment. You could expose sampleMean(), sampleVariance(), even sampleKurtosis(), etc. to the user and call the same common mergeStats().

@@ -419,3 +419,21 @@ cleanClosure <- function(func, checkedFuncs = new.env()) {
}
func
}

# Merge another StatCounter into this one, adding up the internal statistics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use some comments here as to what structure x and y have (i.e. what is x[1] and x[2]). Also a light-weight method to get OOP is to use named lists like list(sum=x1, sumsq=x2) etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on named list

@shivaram
Copy link
Contributor

@hqzizania Would it be possible to re-open this PR against the main Spark repo now ? I did plan to do a automatic move for already open PRs but I feel we have some more things to discuss for this one (like @davies suggestion of using DataFrames). I've already created a JIRA on the Spark side at
https://issues.apache.org/jira/browse/SPARK-6841

Let me know if this sounds good.

@hqzizania
Copy link
Contributor Author

@shivaram I've opened it against the main Spark repo. But the fixes, improvements and replies to suggestions could be delayed because I have no time at this moment.

@shivaram
Copy link
Contributor

@hqzizania - Thanks for the heads up. We can first discuss the design choices and then come to the implementation. I'm going to close the PR here (we can backport it later though I guess its not required as this is a new feature rather than a bug fix)

@shivaram shivaram closed this Apr 10, 2015
asfgit pushed a commit to apache/spark that referenced this pull request May 6, 2015
Moving here from amplab-extras/SparkR-pkg#241
sum() has been implemented. (amplab-extras/SparkR-pkg#242)

Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841

Author: qhuang <qian.huang@intel.com>

Closes #5446 from hqzizania/R and squashes the following commits:

f283572 [qhuang] add test unit for describe()
2e74d5a [qhuang] add describe() DataFrame API

(cherry picked from commit a466944)
Signed-off-by: Reynold Xin <rxin@databricks.com>
asfgit pushed a commit to apache/spark that referenced this pull request May 6, 2015
Moving here from amplab-extras/SparkR-pkg#241
sum() has been implemented. (amplab-extras/SparkR-pkg#242)

Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841

Author: qhuang <qian.huang@intel.com>

Closes #5446 from hqzizania/R and squashes the following commits:

f283572 [qhuang] add test unit for describe()
2e74d5a [qhuang] add describe() DataFrame API
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
Moving here from amplab-extras/SparkR-pkg#241
sum() has been implemented. (amplab-extras/SparkR-pkg#242)

Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841

Author: qhuang <qian.huang@intel.com>

Closes apache#5446 from hqzizania/R and squashes the following commits:

f283572 [qhuang] add test unit for describe()
2e74d5a [qhuang] add describe() DataFrame API
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Moving here from amplab-extras/SparkR-pkg#241
sum() has been implemented. (amplab-extras/SparkR-pkg#242)

Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841

Author: qhuang <qian.huang@intel.com>

Closes apache#5446 from hqzizania/R and squashes the following commits:

f283572 [qhuang] add test unit for describe()
2e74d5a [qhuang] add describe() DataFrame API
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Moving here from amplab-extras/SparkR-pkg#241
sum() has been implemented. (amplab-extras/SparkR-pkg#242)

Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841

Author: qhuang <qian.huang@intel.com>

Closes apache#5446 from hqzizania/R and squashes the following commits:

f283572 [qhuang] add test unit for describe()
2e74d5a [qhuang] add describe() DataFrame API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants