SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases #5148

srowen · 2015-03-23T23:22:41Z

Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

… correctly. Add a test, and fix existing one accordingly

SparkQA · 2015-03-23T23:28:06Z

Test build #29039 has started for PR 5148 at commit 23ec01e.

This patch merges cleanly.

SparkQA · 2015-03-24T00:46:42Z

Test build #29039 has finished for PR 5148 at commit 23ec01e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-24T00:46:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29039/
Test PASSed.

sryza · 2015-03-25T14:38:39Z

core/src/main/scala/org/apache/spark/rdd/DoubleRDDFunctions.scala

        None
      } else {
-        Some(bucketNumber.toInt.min(count - 1))
+        val bucketNumber = (((e - min) / (max - min)) * count).toInt


max - min should stay constant, so I think we could make this decently faster by precomputing (count / (max - min)) and multiplying by it. Maybe the compiler makes this kind of optimization, but I certainly wouldn't count on it. Would that give us the same problem as before?

My gut was that it would be more accurate to compute the ratio of two potentially Huge numbers first, then multiply by something Small, rather than compute the ratio of Small-to-Huge then multiply by a Huge number. If you try min = 0, max = 1e20, count = 1000000000 (thats 10^9), e = 1e11, you get 1 from this expression (correct) whereas the alternative says 0.

srowen · 2015-03-26T12:58:49Z

@FRosner this path is tested by the existing unit tests, but I'll add another test per my last comment, and some comments here.

FRosner · 2015-03-26T13:01:29Z

Thanks. Unit tests and scala doc help to document the interface / parameters and expected behaviour of a function so I like them a lot 😄

…fixes)

SparkQA · 2015-03-26T13:23:29Z

Test build #29231 has started for PR 5148 at commit 974a0a0.

This patch merges cleanly.

SparkQA · 2015-03-26T14:47:46Z

Test build #29231 has finished for PR 5148 at commit 974a0a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-26T14:47:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29231/
Test PASSed.

… edge cases Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly Author: Sean Owen <sowen@cloudera.com> Closes #5148 from srowen/SPARK-6480 and squashes the following commits: 974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes) 23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly (cherry picked from commit fe15ea9) Signed-off-by: Sean Owen <sowen@cloudera.com>

sryza · 2015-03-27T01:28:30Z

( LGTM as well )

… edge cases Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly Author: Sean Owen <sowen@cloudera.com> Closes apache#5148 from srowen/SPARK-6480 and squashes the following commits: 974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes) 23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly (cherry picked from commit fe15ea9) Signed-off-by: Sean Owen <sowen@cloudera.com>

Fix fastBucketFunction for histogram() to handle edge conditions more…

23ec01e

… correctly. Add a test, and fix existing one accordingly

sryza reviewed Mar 25, 2015
View reviewed changes

Additional test of huge ranges, and a few more comments (and comment …

974a0a0

…fixes)

asfgit closed this in fe15ea9 Mar 26, 2015

srowen deleted the SPARK-6480 branch March 26, 2015 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases #5148

SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases #5148

Uh oh!

srowen commented Mar 23, 2015

Uh oh!

SparkQA commented Mar 23, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

AmplabJenkins commented Mar 24, 2015

Uh oh!

sryza Mar 25, 2015

Uh oh!

srowen Mar 25, 2015

Uh oh!

FRosner Mar 26, 2015

Uh oh!

srowen commented Mar 26, 2015

Uh oh!

FRosner commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

AmplabJenkins commented Mar 26, 2015

Uh oh!

sryza commented Mar 27, 2015

Uh oh!

Uh oh!

SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases #5148

SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases #5148

Uh oh!

Conversation

srowen commented Mar 23, 2015

Uh oh!

SparkQA commented Mar 23, 2015

Uh oh!

SparkQA commented Mar 24, 2015

Uh oh!

AmplabJenkins commented Mar 24, 2015

Uh oh!

sryza Mar 25, 2015

Choose a reason for hiding this comment

Uh oh!

srowen Mar 25, 2015

Choose a reason for hiding this comment

Uh oh!

FRosner Mar 26, 2015

Choose a reason for hiding this comment

Uh oh!

srowen commented Mar 26, 2015

Uh oh!

FRosner commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

AmplabJenkins commented Mar 26, 2015

Uh oh!

sryza commented Mar 27, 2015

Uh oh!

Uh oh!