Skip to content

Conversation

blaginin
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the functions Changes to functions implementation label Apr 27, 2025
@github-actions github-actions bot added the core Core DataFusion crate label May 6, 2025
@github-actions github-actions bot removed the core Core DataFusion crate label May 8, 2025
@blaginin
Copy link
Contributor Author

blaginin commented May 8, 2025

group                                              main                                   pr
-----                                              ----                                   --
count low cardinality dict 20% nulls, no filter    50.55    12.2±0.39ms        ? ?/sec    1.00  240.9±327.44µs        ? ?/sec

🚀

@blaginin blaginin self-assigned this May 8, 2025
}

#[test]
fn test_nested_dictionary() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worried about edge cases like dict of dicts, dict of lists, etc., but couldn't come up with anything that breaks the function. Happy to be challenged 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding something to the aggregate fuzzer:

// regarding copyright ownership. The ASF licenses this file

If you add coverage for adding Dictionary arrays and verify COUNT(DISTINCT ..) that would generate some good results

Copy link
Contributor Author

@blaginin blaginin Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a great suggestion - it helped catch a related bug: #16228

Testing itself is in #16232

@blaginin blaginin requested a review from Dandandan May 8, 2025 22:50
@blaginin blaginin marked this pull request as ready for review May 13, 2025 10:18
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @blaginin -- I think this looks good overall. Is there any way you can add soem slt tests as well

Perhaps following the model in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt ?

.map(|dict| {
downcast_dictionary_array! {
dict => {
let buff: BooleanArray = dict.occupancy().into();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL: occupancy

}
}
}
fn get_primitive_type_accumulator(data_type: &DataType) -> Box<dyn Accumulator> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: technically this does much more than primitive types

For example Utf8 is not a primitive types

Maybe it could be more precisely be called get_count_accumulator


fn merge_batch(&mut self, states: &[ArrayRef]) -> datafusion_common::Result<()> {
self.inner.merge_batch(states)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really want to juice performance, we would also implement a GroupsAccumulator for Dictionary as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for sure - I think we can do it on top of this one?

@blaginin
Copy link
Contributor Author

thank you for the review! i really like your fuzzy testing idea - will push soon (and respond to the new comments)

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 2, 2025
@blaginin
Copy link
Contributor Author

blaginin commented Jun 2, 2025

Is there any way you can add soem slt tests as well

Sure! Found the exiting test and extended it

## Multiple distinct aggregates and dictionaries
statement ok
create table dict_test as values (1, arrow_cast('foo', 'Dictionary(Int32, Utf8)')), (2, arrow_cast('bar', 'Dictionary(Int32, Utf8)')), (1, arrow_cast('bar', 'Dictionary(Int32, Utf8)'));
query IT
select * from dict_test;
----
1 foo
2 bar
1 bar

@blaginin blaginin merged commit 5e307b3 into apache:main Jun 5, 2025
27 checks passed
kosiew pushed a commit to kosiew/datafusion that referenced this pull request Jun 9, 2025
* Handle dicts for distinct count

* Fix sqllogictests

* Add bench

* Fix no fix the bench

* Do not panic if error type is bad

* Add full bench query

* Set the bench

* Add dict of dict test

* Fix tests

* Rename method

* Increase the grouping test

* Increase the grouping test a bit more :)

* Fix flakiness

---------

Co-authored-by: Dmitrii Blaginin <blaginin@bmac.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of COUNT (distinct x) for dictionary columns
2 participants