-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix distinct count for DictionaryArray to correctly account for nulls in values array #16258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you @kosiew! Do you mind also changing https://github.com/apache/datafusion/pull/16232/files#diff-08d7a1f4d6a968c393a2a0f2a2f54118f38d6a29009ce31b261f3ca27a2d3396R733 and making sure the fuzzy tests still pass? 🙏🏻 |
create table dict_null_test as | ||
select arrow_cast(NULL, 'Dictionary(Int32, Utf8)') as d | ||
from (values (1), (2), (3), (4), (5)); | ||
|
||
query I | ||
select count(distinct d) from dict_null_test; | ||
---- | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this passes on main currently
DataFusion CLI v48.0.0
> create table dict_null_test as
select arrow_cast(NULL, 'Dictionary(Int32, Utf8)') as d
from (values (1), (2), (3), (4), (5));
0 row(s) fetched.
Elapsed 0.024 seconds.
> select count(distinct d) from dict_null_test;
+----------------------------------+
| count(DISTINCT dict_null_test.d) |
+----------------------------------+
| 0 |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.016 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can generate a slt test that would match this behaviour; I believe sql can't express the case where indices + values are separated the way they are as outlined in the issue. I think we should just have a regular non-slt test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was struggling with how to create the slt test.
Thanks @jonathanc-n for letting me know that it's not possible.
I added regular tests instead.
@@ -711,8 +711,8 @@ impl Accumulator for DistinctCountAccumulator { | |||
} | |||
|
|||
(0..arr.len()).try_for_each(|index| { | |||
if !arr.is_null(index) { | |||
let scalar = ScalarValue::try_from_array(arr, index)?; | |||
let scalar = ScalarValue::try_from_array(arr, index)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
* Move struct QueryResult to util/run.rs * Modify benches to continue query execution even on failure * Mark benchmark query success on output json
…es are null" This reverts commit c745dae.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…ing all null values
While working on trying to reproduce the case in the #16228, I found that #15871 fixes the example case the #16228 in this comment. |
I am not quite sure what to do now. Do you still think we should merge the PR? |
Thank you @kosiew |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Great job digging further earlier today 🥇
A cool follow-up could be to check where else we do .is_null()
on dicts and might have the same issue. It's on my todo list, but feel free to steal it if you enjoyed fixing this bug 😁
Which issue does this PR close?
- Closes #16228Rationale for this change
Array::is_null
does not correctly identify nulls forDictionaryArray
when the indices point to nulls in the values array. This causes incorrect results in aggregation queries such ascount(distinct ...)
, which should skip nulls but currently may include them due to improper null handling. The change ensures nulls in dictionary values are correctly detected and excluded.Arrow's hands are tied on this matter and so we are fixing the issue in this repo.
What changes are included in this PR?
DistinctCountAccumulator
to properly skip null dictionary values.count.rs
to exercise dictionary scenarios more thoroughly.aggregate.slt
) to include dictionary-null COUNT(DISTINCT) behavior.Are these changes tested?
Yes ✅
functions-aggregate/src/count.rs
.aggregates.rs
.Are there any user-facing changes?
Yes. Users will now observe correct results when using
count
orcount(distinct)
on dictionary-encoded columns with null values. No API changes are introduced, but behavior is now aligned with expected SQL aggregation semantics.