-
Notifications
You must be signed in to change notification settings - Fork 28.6k
SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases #5148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… correctly. Add a test, and fix existing one accordingly
Test build #29039 has started for PR 5148 at commit
|
Test build #29039 has finished for PR 5148 at commit
|
Test PASSed. |
None | ||
} else { | ||
Some(bucketNumber.toInt.min(count - 1)) | ||
val bucketNumber = (((e - min) / (max - min)) * count).toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max - min should stay constant, so I think we could make this decently faster by precomputing (count / (max - min)) and multiplying by it. Maybe the compiler makes this kind of optimization, but I certainly wouldn't count on it. Would that give us the same problem as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut was that it would be more accurate to compute the ratio of two potentially Huge numbers first, then multiply by something Small, rather than compute the ratio of Small-to-Huge then multiply by a Huge number. If you try min = 0, max = 1e20, count = 1000000000 (thats 10^9), e = 1e11, you get 1 from this expression (correct) whereas the alternative says 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
@FRosner this path is tested by the existing unit tests, but I'll add another test per my last comment, and some comments here. |
Thanks. Unit tests and scala doc help to document the interface / parameters and expected behaviour of a function so I like them a lot 😄 |
Test build #29231 has started for PR 5148 at commit
|
Test build #29231 has finished for PR 5148 at commit
|
Test PASSed. |
… edge cases Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly Author: Sean Owen <sowen@cloudera.com> Closes #5148 from srowen/SPARK-6480 and squashes the following commits: 974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes) 23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly (cherry picked from commit fe15ea9) Signed-off-by: Sean Owen <sowen@cloudera.com>
… edge cases Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly Author: Sean Owen <sowen@cloudera.com> Closes #5148 from srowen/SPARK-6480 and squashes the following commits: 974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes) 23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly (cherry picked from commit fe15ea9) Signed-off-by: Sean Owen <sowen@cloudera.com>
( LGTM as well ) |
… edge cases Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly Author: Sean Owen <sowen@cloudera.com> Closes apache#5148 from srowen/SPARK-6480 and squashes the following commits: 974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes) 23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly (cherry picked from commit fe15ea9) Signed-off-by: Sean Owen <sowen@cloudera.com>
Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly