Skip to content

Conversation

@kazantsev-maksim
Copy link
Contributor

Which issue does this PR close?

Closes #18225

Rationale for this change

After adding the bit_count function in Comet, we got different results from Spark. (apache/datafusion-comet#2553)

Are these changes tested?

Tested with existing unit tests

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Oct 28, 2025
let value_array = value_array[0].as_ref();
match value_array.data_type() {
DataType::Int8 => {
DataType::Int8 | DataType::Boolean => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark supports only signed int types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems misplaced as the code adds support for boolean 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this code will now panic if you pass in a Boolean array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to note that Spark only supports signed integer and boolean types as input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an additional test for BooleanArray.

@Jefffrey Jefffrey added this pull request to the merge queue Nov 2, 2025
Merged via the queue into apache:main with commit d445b61 Nov 2, 2025
28 checks passed
tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
## Which issue does this PR close?

Closes apache#18225

## Rationale for this change

After adding the bit_count function in Comet, we got different results
from Spark. (apache/datafusion-comet#2553)

## Are these changes tested?

Tested with existing unit tests

---------

Co-authored-by: Kazantsev Maksim <mn.kazantsev@gmail.com>
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
codetyri0n pushed a commit to codetyri0n/datafusion that referenced this pull request Nov 11, 2025
## Which issue does this PR close?

Closes apache#18225

## Rationale for this change

After adding the bit_count function in Comet, we got different results
from Spark. (apache/datafusion-comet#2553)

## Are these changes tested?

Tested with existing unit tests

---------

Co-authored-by: Kazantsev Maksim <mn.kazantsev@gmail.com>
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
@pepijnve
Copy link
Contributor

Sorry for reviving this one. We had a need for bit_count in our codebase as well, so I was looking at the Spark implementation. This test seems really bizarre:

#[test]
fn test_bit_count_int32() {
    // Test bit_count on Int32Array
    let result =
        spark_bit_count(&[Arc::new(Int32Array::from(vec![0i32, 1, 255, 1023, -1]))])
            .unwrap();

    let arr = result.as_primitive::<Int32Type>();
...
    assert_eq!(arr.value(4), 64); // -1 in two's complement = all 32 bits set
}

Counting the bits of a 32-bit value and getting back 64 is not what you would expect at all. Is that really how Spark works? It's a bit underspecced (or at least ambiguous) in the Spark documentation.

Just for comparison, here's DuckDB's answer to the same question

D select bit_count(cast(-1 as int));
┌────────────────────────────────┐
│ bit_count(CAST(-1 AS INTEGER)) │
│              int8              │
├────────────────────────────────┤
│               32               │
└────────────────────────────────┘

@pepijnve
Copy link
Contributor

pepijnve commented Nov 20, 2025

Figured it out in the meantime. Spark's implementation (like, for instance also SQL Server's) always operates on BIGINT (i.e. 64-bit). I logged a PR to simplify the implementation.

github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025
## Which issue does this PR close?

- Followup to #18225 and PR #18322

## Rationale for this change

Spark's `bit_count` function always operators on 64-bit values, while
the original `bit_count` implementation in `datafusion_spark` operated
on the native size of the input value.
In order to fix this a custom bit counting implementation was ported
over from the Java Spark implementation. This isn't really necessary
though. Widening signed integers to `i64` and then using
`i64::count_ones` will get you the exact same result and is less
obscure.

## What changes are included in this PR?

Remove custom `bitcount` logic and use `i64::count_ones` instead.

## Are these changes tested?

Covered by existing tests that were added for #18225

## Are there any user-facing changes?

No
logan-keede pushed a commit to logan-keede/datafusion that referenced this pull request Nov 23, 2025
…8841)

## Which issue does this PR close?

- Followup to apache#18225 and PR apache#18322

## Rationale for this change

Spark's `bit_count` function always operators on 64-bit values, while
the original `bit_count` implementation in `datafusion_spark` operated
on the native size of the input value.
In order to fix this a custom bit counting implementation was ported
over from the Java Spark implementation. This isn't really necessary
though. Widening signed integers to `i64` and then using
`i64::count_ones` will get you the exact same result and is less
obscure.

## What changes are included in this PR?

Remove custom `bitcount` logic and use `i64::count_ones` instead.

## Are these changes tested?

Covered by existing tests that were added for apache#18225

## Are there any user-facing changes?

No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bit_count in spark create not fuly compatible with spark

4 participants