Skip to content

Fix distinct count for DictionaryArray to correctly account for nulls in values array #16258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jun 9, 2025

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Jun 5, 2025

Which issue does this PR close?

- Closes #16228

Rationale for this change

Array::is_null does not correctly identify nulls for DictionaryArray when the indices point to nulls in the values array. This causes incorrect results in aggregation queries such as count(distinct ...), which should skip nulls but currently may include them due to improper null handling. The change ensures nulls in dictionary values are correctly detected and excluded.

Arrow's hands are tied on this matter and so we are fixing the issue in this repo.

What changes are included in this PR?

  • Corrects the logic in DistinctCountAccumulator to properly skip null dictionary values.
  • Adds integration tests validating COUNT(DISTINCT) for dictionary arrays:
    • With all null values.
    • With a mix of null and non-null values.
  • Adds new unit tests in count.rs to exercise dictionary scenarios more thoroughly.
  • Updates SQL logic test (aggregate.slt) to include dictionary-null COUNT(DISTINCT) behavior.
  • Cleans up imports and makes related refactors to support the added logic and tests.

Are these changes tested?

Yes ✅

  • Multiple new unit tests added in functions-aggregate/src/count.rs.
  • New integration tests in aggregates.rs.
  • SQLogicTest file updated to cover dictionary null cases.

Are there any user-facing changes?

Yes. Users will now observe correct results when using count or count(distinct) on dictionary-encoded columns with null values. No API changes are introduced, but behavior is now aligned with expected SQL aggregation semantics.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jun 5, 2025
@blaginin
Copy link
Contributor

blaginin commented Jun 5, 2025

Thank you @kosiew! Do you mind also changing https://github.com/apache/datafusion/pull/16232/files#diff-08d7a1f4d6a968c393a2a0f2a2f54118f38d6a29009ce31b261f3ca27a2d3396R733 and making sure the fuzzy tests still pass? 🙏🏻

Comment on lines +5035 to +5044
create table dict_null_test as
select arrow_cast(NULL, 'Dictionary(Int32, Utf8)') as d
from (values (1), (2), (3), (4), (5));

query I
select count(distinct d) from dict_null_test;
----
0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this passes on main currently

DataFusion CLI v48.0.0
> create table dict_null_test as
    select arrow_cast(NULL, 'Dictionary(Int32, Utf8)') as d
    from (values (1), (2), (3), (4), (5));
0 row(s) fetched. 
Elapsed 0.024 seconds.

> select count(distinct d) from dict_null_test;
+----------------------------------+
| count(DISTINCT dict_null_test.d) |
+----------------------------------+
| 0                                |
+----------------------------------+
1 row(s) fetched. 
Elapsed 0.016 seconds.

Copy link
Contributor

@jonathanc-n jonathanc-n Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can generate a slt test that would match this behaviour; I believe sql can't express the case where indices + values are separated the way they are as outlined in the issue. I think we should just have a regular non-slt test for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was struggling with how to create the slt test.
Thanks @jonathanc-n for letting me know that it's not possible.
I added regular tests instead.

@@ -711,8 +711,8 @@ impl Accumulator for DistinctCountAccumulator {
}

(0..arr.len()).try_for_each(|index| {
if !arr.is_null(index) {
let scalar = ScalarValue::try_from_array(arr, index)?;
let scalar = ScalarValue::try_from_array(arr, index)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me @kosiew, I tested myself and they fail -> pass accordingly! I think keeping the slt test is fine just to prevent any regression for this null case, however I'm not sure whether or not we should add the reproducible that was put in the issue. WDYT @blaginin @kosiew?

@blaginin blaginin self-requested a review June 9, 2025 07:37
@kosiew
Copy link
Contributor Author

kosiew commented Jun 9, 2025

While working on trying to reproduce the case in the #16228, I found that #15871 fixes the example case the #16228 in this comment.

@kosiew kosiew closed this Jun 9, 2025
@kosiew kosiew reopened this Jun 9, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Jun 9, 2025
@alamb
Copy link
Contributor

alamb commented Jun 9, 2025

While working on trying to reproduce the case in the #16228, I found that #15871 fixes the example case the #16228 in this comment.

I am not quite sure what to do now. Do you still think we should merge the PR?

@kosiew
Copy link
Contributor Author

kosiew commented Jun 9, 2025

@alamb ,
#15871 does not fix #16339
I amended the PR details, - this PR now closes #16339
So, this PR is still needed

@alamb
Copy link
Contributor

alamb commented Jun 9, 2025

Thank you @kosiew

Copy link
Contributor

@blaginin blaginin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Great job digging further earlier today 🥇

A cool follow-up could be to check where else we do .is_null() on dicts and might have the same issue. It's on my todo list, but feel free to steal it if you enjoyed fixing this bug 😁

@blaginin blaginin merged commit bd85bed into apache:main Jun 9, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

COUNT and COUNT DISTINCT produce incorrect results for dictionary arrays with null values Incorrect count null in dict values
5 participants