-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat(datafusion-functions-aggregate): add support for lists and other nested types in min
and max
#13991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
min
and max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rluvaton -- I reviewed this code and it looks quite nice. Thank you 👌 🏆
The only thing I think is missing is "end to end" query coverage. In addition to the empty input test, I think we need tests specifically for the sliding window min/max
Can you please add such tests as .slt
tests -- it would be queries something like
SELECT min(list_col) OVER (ORDER BY other_col 1 ROW PRECEDING 1 ROW FOLLOWING) FROM t
where list_col had nulls in it
Bonus points for tests of:
struct
Min/Max
Let mek nw
acc_args.return_type, | ||
)?)) | ||
} else { | ||
Ok(Box::new(MaxAccumulator::try_new(acc_args.return_type)?)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern used in other accumulators is slightly different I think -- It would have a function on MaxAccumulator
that explicitly declared the types it supported and then fallback to the GenericMaxAccumulator
if not.
For example, something like
if MaxAccumulator::supports_type(&acc_args.return_type) {
Ok(Box::new(MaxAccumulator::try_new(acc_args.return_type)?))
} else {
Ok(Box::new(GenericMaxAccumulator::try_new(
acc_args.return_type,
)?))
}
This is a minor detail and not required, I just wanted to point it out
Ok(Box::new(GenericMaxAccumulator::try_new( | ||
acc_args.return_type, | ||
)?)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to make this less verbose you could probably do something like
Ok(Box::new(GenericMaxAccumulator::try_new( | |
acc_args.return_type, | |
)?)) | |
GenericMaxAccumulator::try_new( | |
acc_args.return_type, | |
).map(Box::new) |
} | ||
|
||
fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> { | ||
// TODO - should assert getting only one column? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to either add the assert or remove the TODO as part of this PR
@@ -108,11 +108,15 @@ SELECT * FROM data WHERE column2 is not distinct from null; | |||
# Aggregates | |||
########### | |||
|
|||
query error Internal error: Min/Max accumulator not implemented for type List | |||
query ? | |||
SELECT min(column1) FROM data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add tests for empty input on min/max as well to test boundary conditions? Something like
SELECT MIN(column1) FROM data WHERE false
Which should return a null value?
|
||
// Create a null row to use for filtering out nulls from the input | ||
let null_row = { | ||
let null_array = ScalarValue::try_from(datatype)?.to_array_of_size(1)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let null_array = ScalarValue::try_from(datatype)?.to_array_of_size(1)?; | |
let null_array = new_null_array(datatype, 1); |
} | ||
|
||
#[test] | ||
fn basic_i32_list() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
statement count 0
create table t1 (a int[]) as values ([1,2,3]), ([1,2,1]);
query ?
select min(a) from t1;
----
[1, 2, 1]
statement count 0
drop table t1;
You can create test like this, it is easier to maintain than this rust test
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look Specifically I think this PR just needs some more tests and it will be good to go |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Hey @alamb @jayzhan211, I see that this PR is closed, is it possible to reopen this and complete this feature? |
Thanks @ologlogn ! From my perspective this test simply needed some more slt tests
What I would suggest is creating a new PR using the code in this branch as a starting point and then add more tests, get it ready for review. fyi @rluvaton |
I did not meant to close it, sorry, it is in the back of my mind, I'll finish this now, @ologlogn don't take it 😀 |
Thanks @rluvaton! It will be really helpful :) |
@rluvaton do you have an estimate of when this might be shipped? We’re currently blocked by this, so we’d be glad to handle it ourselves if that would help ease your workload. |
I'm really sorry, had crazy week with the baby, will work on it today |
Sorry, I could not find time today, you can take this but please add me as co-author for my changes: |
Work resumed in #15857 |
Which issue does this PR close?
Closes #13987.
Rationale for this change
You are now able to run min and max on lists
What changes are included in this PR?
Added
GenericMinMaxAccumulator
andGenericSlidingMinMaxAccumulator
which calculate the min and max based on thearrow-row
crateAre these changes tested?
For the min and max yes, for the moving min max - no (fuzzy testing will help here a lot to cover the large range of tests)
Are there any user-facing changes?
Yes, users are now allowed to sort lists and every type that is supported by the
arrow-row