Skip to content

SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases #5148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Mar 23, 2015

Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

… correctly. Add a test, and fix existing one accordingly
@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #29039 has started for PR 5148 at commit 23ec01e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29039 has finished for PR 5148 at commit 23ec01e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29039/
Test PASSed.

None
} else {
Some(bucketNumber.toInt.min(count - 1))
val bucketNumber = (((e - min) / (max - min)) * count).toInt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max - min should stay constant, so I think we could make this decently faster by precomputing (count / (max - min)) and multiplying by it. Maybe the compiler makes this kind of optimization, but I certainly wouldn't count on it. Would that give us the same problem as before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut was that it would be more accurate to compute the ratio of two potentially Huge numbers first, then multiply by something Small, rather than compute the ratio of Small-to-Huge then multiply by a Huge number. If you try min = 0, max = 1e20, count = 1000000000 (thats 10^9), e = 1e11, you get 1 from this expression (correct) whereas the alternative says 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@srowen
Copy link
Member Author

srowen commented Mar 26, 2015

@FRosner this path is tested by the existing unit tests, but I'll add another test per my last comment, and some comments here.

@FRosner
Copy link
Contributor

FRosner commented Mar 26, 2015

Thanks. Unit tests and scala doc help to document the interface / parameters and expected behaviour of a function so I like them a lot 😄

@SparkQA
Copy link

SparkQA commented Mar 26, 2015

Test build #29231 has started for PR 5148 at commit 974a0a0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 26, 2015

Test build #29231 has finished for PR 5148 at commit 974a0a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29231/
Test PASSed.

asfgit pushed a commit that referenced this pull request Mar 26, 2015
… edge cases

Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

Author: Sean Owen <sowen@cloudera.com>

Closes #5148 from srowen/SPARK-6480 and squashes the following commits:

974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes)
23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

(cherry picked from commit fe15ea9)
Signed-off-by: Sean Owen <sowen@cloudera.com>
asfgit pushed a commit that referenced this pull request Mar 26, 2015
… edge cases

Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

Author: Sean Owen <sowen@cloudera.com>

Closes #5148 from srowen/SPARK-6480 and squashes the following commits:

974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes)
23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

(cherry picked from commit fe15ea9)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@asfgit asfgit closed this in fe15ea9 Mar 26, 2015
@srowen srowen deleted the SPARK-6480 branch March 26, 2015 15:02
@sryza
Copy link
Contributor

sryza commented Mar 27, 2015

( LGTM as well )

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Apr 15, 2015
… edge cases

Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

Author: Sean Owen <sowen@cloudera.com>

Closes apache#5148 from srowen/SPARK-6480 and squashes the following commits:

974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes)
23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

(cherry picked from commit fe15ea9)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants