-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Handle dicts for distinct count #15871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🚀 |
} | ||
|
||
#[test] | ||
fn test_nested_dictionary() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worried about edge cases like dict of dicts, dict of lists, etc., but couldn't come up with anything that breaks the function. Happy to be challenged 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend adding something to the aggregate fuzzer:
// regarding copyright ownership. The ASF licenses this file |
If you add coverage for adding Dictionary arrays and verify COUNT(DISTINCT ..)
that would generate some good results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @blaginin -- I think this looks good overall. Is there any way you can add soem slt
tests as well
Perhaps following the model in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt ?
.map(|dict| { | ||
downcast_dictionary_array! { | ||
dict => { | ||
let buff: BooleanArray = dict.occupancy().into(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL: occupancy
} | ||
} | ||
} | ||
fn get_primitive_type_accumulator(data_type: &DataType) -> Box<dyn Accumulator> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: technically this does much more than primitive types
For example Utf8
is not a primitive types
Maybe it could be more precisely be called get_count_accumulator
|
||
fn merge_batch(&mut self, states: &[ArrayRef]) -> datafusion_common::Result<()> { | ||
self.inner.merge_batch(states) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really want to juice performance, we would also implement a GroupsAccumulator
for Dictionary as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for sure - I think we can do it on top of this one?
thank you for the review! i really like your fuzzy testing idea - will push soon (and respond to the new comments) |
Sure! Found the exiting test and extended it datafusion/datafusion/sqllogictest/test_files/aggregate.slt Lines 5027 to 5036 in 8ed4259
|
* Handle dicts for distinct count * Fix sqllogictests * Add bench * Fix no fix the bench * Do not panic if error type is bad * Add full bench query * Set the bench * Add dict of dict test * Fix tests * Rename method * Increase the grouping test * Increase the grouping test a bit more :) * Fix flakiness --------- Co-authored-by: Dmitrii Blaginin <blaginin@bmac.local>
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?