Fix distinct count for DictionaryArray to correctly account for nulls in values array #16258

kosiew · 2025-06-05T10:31:32Z

Which issue does this PR close?

~~- Closes #16228~~

Closes COUNT and COUNT DISTINCT produce incorrect results for dictionary arrays with null values #16339

Rationale for this change

Array::is_null does not correctly identify nulls for DictionaryArray when the indices point to nulls in the values array. This causes incorrect results in aggregation queries such as count(distinct ...), which should skip nulls but currently may include them due to improper null handling. The change ensures nulls in dictionary values are correctly detected and excluded.

Arrow's hands are tied on this matter and so we are fixing the issue in this repo.

What changes are included in this PR?

Corrects the logic in DistinctCountAccumulator to properly skip null dictionary values.
Adds integration tests validating COUNT(DISTINCT) for dictionary arrays:
- With all null values.
- With a mix of null and non-null values.
Adds new unit tests in count.rs to exercise dictionary scenarios more thoroughly.
Updates SQL logic test (aggregate.slt) to include dictionary-null COUNT(DISTINCT) behavior.
Cleans up imports and makes related refactors to support the added logic and tests.

Are these changes tested?

Yes ✅

Multiple new unit tests added in functions-aggregate/src/count.rs.
New integration tests in aggregates.rs.
SQLogicTest file updated to cover dictionary null cases.

Are there any user-facing changes?

Yes. Users will now observe correct results when using count or count(distinct) on dictionary-encoded columns with null values. No API changes are introduced, but behavior is now aligned with expected SQL aggregation semantics.

blaginin · 2025-06-05T15:22:23Z

Thank you @kosiew! Do you mind also changing https://github.com/apache/datafusion/pull/16232/files#diff-08d7a1f4d6a968c393a2a0f2a2f54118f38d6a29009ce31b261f3ca27a2d3396R733 and making sure the fuzzy tests still pass? 🙏🏻

blaginin · 2025-06-05T15:26:21Z

datafusion/sqllogictest/test_files/aggregate.slt

+create table dict_null_test as
+    select arrow_cast(NULL, 'Dictionary(Int32, Utf8)') as d
+    from (values (1), (2), (3), (4), (5));
+
+query I
+select count(distinct d) from dict_null_test;
+----
+0


I think this passes on main currently

DataFusion CLI v48.0.0 > create table dict_null_test as select arrow_cast(NULL, 'Dictionary(Int32, Utf8)') as d from (values (1), (2), (3), (4), (5)); 0 row(s) fetched. Elapsed 0.024 seconds. > select count(distinct d) from dict_null_test; +----------------------------------+ | count(DISTINCT dict_null_test.d) | +----------------------------------+ | 0 | +----------------------------------+ 1 row(s) fetched. Elapsed 0.016 seconds.

I don't think we can generate a slt test that would match this behaviour; I believe sql can't express the case where indices + values are separated the way they are as outlined in the issue. I think we should just have a regular non-slt test for this.

I was struggling with how to create the slt test.
Thanks @jonathanc-n for letting me know that it's not possible.
I added regular tests instead.

alamb · 2025-06-05T19:32:33Z

datafusion/functions-aggregate/src/count.rs

@@ -711,8 +711,8 @@ impl Accumulator for DistinctCountAccumulator {
        }

        (0..arr.len()).try_for_each(|index| {
-            if !arr.is_null(index) {
-                let scalar = ScalarValue::try_from_array(arr, index)?;
+            let scalar = ScalarValue::try_from_array(arr, index)?;


* Move struct QueryResult to util/run.rs * Modify benches to continue query execution even on failure * Mark benchmark query success on output json

…es are null" This reverts commit c745dae.

…ing null values

… null values

jonathanc-n

This looks good to me @kosiew, I tested myself and they fail -> pass accordingly! I think keeping the slt test is fine just to prevent any regression for this null case, however I'm not sure whether or not we should add the reproducible that was put in the issue. WDYT @blaginin @kosiew?

…ing all null values

…ctions

kosiew · 2025-06-09T08:02:32Z

While working on trying to reproduce the case in the #16228, I found that #15871 fixes the example case the #16228 in this comment.

alamb · 2025-06-09T12:50:53Z

While working on trying to reproduce the case in the #16228, I found that #15871 fixes the example case the #16228 in this comment.

I am not quite sure what to do now. Do you still think we should merge the PR?

kosiew · 2025-06-09T13:10:51Z

@alamb ,
#15871 does not fix #16339
I amended the PR details, - this PR now closes #16339
So, this PR is still needed

alamb · 2025-06-09T13:27:53Z

Thank you @kosiew

blaginin

Thanks for working on this! Great job digging further earlier today 🥇

A cool follow-up could be to check where else we do .is_null() on dicts and might have the same issue. It's on my todo list, but feel free to steal it if you enjoyed fixing this bug 😁

kosiew added 2 commits June 5, 2025 18:26

Add tests

3c1f883

trigger ci

15a8531

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jun 5, 2025

kosiew force-pushed the count-16228 branch from 9b72bb5 to 45eb116 Compare June 5, 2025 10:31

blaginin reviewed Jun 5, 2025

View reviewed changes

alamb reviewed Jun 5, 2025

View reviewed changes

kosiew mentioned this pull request Jun 6, 2025

Update Fuzz tests to include Dict with null values #16266

Open

ding-young and others added 8 commits June 6, 2025 12:12

Update tpch, clickbench, sort_tpch to mark failed queries (apache#16182)

3f1b3ce

* Move struct QueryResult to util/run.rs * Modify benches to continue query execution even on failure * Mark benchmark query success on output json

Adjust slttest to pass without RUST_BACKTRACE enabled (apache#16251)

fb96483

add more tests where the dict keys are not null but dict values are null

39da58d

Revert "add more tests where the dict keys are not null but dict valu…

cf6a884

…es are null" This reverts commit c745dae.

Add tests for count and count_distinct with dictionary arrays contain…

d9f1e2c

…ing null values

Merge branch 'main' into count-16228

d084d30

resolve merge conflict, reorder imports

522b1aa

Add helper function to create dictionary array with non-null keys and…

5109daf

… null values

kosiew force-pushed the count-16228 branch from ad95d47 to 5109daf Compare June 6, 2025 04:52

jonathanc-n approved these changes Jun 6, 2025

View reviewed changes

kosiew added 11 commits June 6, 2025 15:55

Add test for count_distinct accumulator with dictionary array contain…

58b474a

…ing all null values

add tests to aggregate.rs

4453409

remove redundant comments in get_formatted_results function

1582972

remove redundant safety checks in count distinct dictionary test

6bcc1aa

refactor: introduce helper function for output assertion

35b8aed

test: add count distinct dictionary handling for null values

9b33ea7

Merge main

5ce96ff

refactor: streamline imports and improve code organization in count.rs

cd0c489

fix: add missing import for batches_to_string in aggregates.rs

3be1760

test: reorganize tests

5923c38

refactor: simplify COUNT(DISTINCT) tests and remove unused helper fun…

46a70ba

…ctions

refactor: trim redundant tests

84112c0

blaginin self-requested a review June 9, 2025 07:37

kosiew closed this Jun 9, 2025

kosiew reopened this Jun 9, 2025

github-actions bot added the core Core DataFusion crate label Jun 9, 2025

Merge branch 'main' into count-16228

92e1681

alamb approved these changes Jun 9, 2025

View reviewed changes

blaginin approved these changes Jun 9, 2025

View reviewed changes

Merge branch 'main' into count-16228

010d784

blaginin merged commit bd85bed into apache:main Jun 9, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix distinct count for DictionaryArray to correctly account for nulls in values array #16258

Fix distinct count for DictionaryArray to correctly account for nulls in values array #16258

kosiew commented Jun 5, 2025 •

edited

Loading

Uh oh!

blaginin commented Jun 5, 2025

Uh oh!

blaginin Jun 5, 2025

Uh oh!

jonathanc-n Jun 5, 2025 •

edited

Loading

Uh oh!

kosiew Jun 6, 2025

Uh oh!

alamb Jun 5, 2025

Uh oh!

jonathanc-n left a comment

Uh oh!

kosiew commented Jun 9, 2025 •

edited

Loading

Uh oh!

alamb commented Jun 9, 2025

Uh oh!

kosiew commented Jun 9, 2025 •

edited

Loading

Uh oh!

alamb commented Jun 9, 2025

Uh oh!

blaginin left a comment

Uh oh!

Uh oh!

Uh oh!

Fix distinct count for DictionaryArray to correctly account for nulls in values array #16258

Fix distinct count for DictionaryArray to correctly account for nulls in values array #16258

Conversation

kosiew commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

blaginin commented Jun 5, 2025

Uh oh!

blaginin Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

jonathanc-n Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

jonathanc-n left a comment

Choose a reason for hiding this comment

Uh oh!

kosiew commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jun 9, 2025

Uh oh!

kosiew commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jun 9, 2025

Uh oh!

blaginin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kosiew commented Jun 5, 2025 •

edited

Loading

jonathanc-n Jun 5, 2025 •

edited

Loading

kosiew commented Jun 9, 2025 •

edited

Loading

kosiew commented Jun 9, 2025 •

edited

Loading