-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix: spark bit_count function #18322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let value_array = value_array[0].as_ref(); | ||
| match value_array.data_type() { | ||
| DataType::Int8 => { | ||
| DataType::Int8 | DataType::Boolean => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark supports only signed int types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems misplaced as the code adds support for boolean 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this code will now panic if you pass in a Boolean array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to note that Spark only supports signed integer and boolean types as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an additional test for BooleanArray.
## Which issue does this PR close? Closes apache#18225 ## Rationale for this change After adding the bit_count function in Comet, we got different results from Spark. (apache/datafusion-comet#2553) ## Are these changes tested? Tested with existing unit tests --------- Co-authored-by: Kazantsev Maksim <mn.kazantsev@gmail.com> Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
## Which issue does this PR close? Closes apache#18225 ## Rationale for this change After adding the bit_count function in Comet, we got different results from Spark. (apache/datafusion-comet#2553) ## Are these changes tested? Tested with existing unit tests --------- Co-authored-by: Kazantsev Maksim <mn.kazantsev@gmail.com> Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
|
Sorry for reviving this one. We had a need for Counting the bits of a 32-bit value and getting back 64 is not what you would expect at all. Is that really how Spark works? It's a bit underspecced (or at least ambiguous) in the Spark documentation. Just for comparison, here's DuckDB's answer to the same question |
|
Figured it out in the meantime. Spark's implementation (like, for instance also SQL Server's) always operates on |
## Which issue does this PR close? - Followup to #18225 and PR #18322 ## Rationale for this change Spark's `bit_count` function always operators on 64-bit values, while the original `bit_count` implementation in `datafusion_spark` operated on the native size of the input value. In order to fix this a custom bit counting implementation was ported over from the Java Spark implementation. This isn't really necessary though. Widening signed integers to `i64` and then using `i64::count_ones` will get you the exact same result and is less obscure. ## What changes are included in this PR? Remove custom `bitcount` logic and use `i64::count_ones` instead. ## Are these changes tested? Covered by existing tests that were added for #18225 ## Are there any user-facing changes? No
…8841) ## Which issue does this PR close? - Followup to apache#18225 and PR apache#18322 ## Rationale for this change Spark's `bit_count` function always operators on 64-bit values, while the original `bit_count` implementation in `datafusion_spark` operated on the native size of the input value. In order to fix this a custom bit counting implementation was ported over from the Java Spark implementation. This isn't really necessary though. Widening signed integers to `i64` and then using `i64::count_ones` will get you the exact same result and is less obscure. ## What changes are included in this PR? Remove custom `bitcount` logic and use `i64::count_ones` instead. ## Are these changes tested? Covered by existing tests that were added for apache#18225 ## Are there any user-facing changes? No
Which issue does this PR close?
Closes #18225
Rationale for this change
After adding the bit_count function in Comet, we got different results from Spark. (apache/datafusion-comet#2553)
Are these changes tested?
Tested with existing unit tests