Skip to content

feat(datafusion-functions-aggregate): add support for lists and other nested types in min and max #13991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

rluvaton
Copy link
Contributor

@rluvaton rluvaton commented Jan 2, 2025

Which issue does this PR close?

Closes #13987.

Rationale for this change

You are now able to run min and max on lists

What changes are included in this PR?

Added GenericMinMaxAccumulator and GenericSlidingMinMaxAccumulator which calculate the min and max based on the arrow-row crate

Are these changes tested?

For the min and max yes, for the moving min max - no (fuzzy testing will help here a lot to cover the large range of tests)

Are there any user-facing changes?

Yes, users are now allowed to sort lists and every type that is supported by the arrow-row

@github-actions github-actions bot added the functions Changes to functions implementation label Jan 2, 2025
@rluvaton rluvaton changed the title add support for lists in min feat(datafusion-functions-aggregate): add support for lists and other nested types in min and max Jan 2, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 2, 2025
@rluvaton rluvaton marked this pull request as ready for review January 2, 2025 20:35
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rluvaton -- I reviewed this code and it looks quite nice. Thank you 👌 🏆

The only thing I think is missing is "end to end" query coverage. In addition to the empty input test, I think we need tests specifically for the sliding window min/max

Can you please add such tests as .slt tests -- it would be queries something like

SELECT min(list_col) OVER (ORDER BY other_col 1 ROW PRECEDING 1 ROW FOLLOWING) FROM t

where list_col had nulls in it

Bonus points for tests of:

  1. struct Min/Max

Let mek nw

acc_args.return_type,
)?))
} else {
Ok(Box::new(MaxAccumulator::try_new(acc_args.return_type)?))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern used in other accumulators is slightly different I think -- It would have a function on MaxAccumulator that explicitly declared the types it supported and then fallback to the GenericMaxAccumulator if not.

For example, something like

if MaxAccumulator::supports_type(&acc_args.return_type) {
   Ok(Box::new(MaxAccumulator::try_new(acc_args.return_type)?))
} else {
   Ok(Box::new(GenericMaxAccumulator::try_new(
                acc_args.return_type,
            )?))
}

This is a minor detail and not required, I just wanted to point it out

Comment on lines +239 to +241
Ok(Box::new(GenericMaxAccumulator::try_new(
acc_args.return_type,
)?))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to make this less verbose you could probably do something like

Suggested change
Ok(Box::new(GenericMaxAccumulator::try_new(
acc_args.return_type,
)?))
GenericMaxAccumulator::try_new(
acc_args.return_type,
).map(Box::new)

}

fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
// TODO - should assert getting only one column?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to either add the assert or remove the TODO as part of this PR

@@ -108,11 +108,15 @@ SELECT * FROM data WHERE column2 is not distinct from null;
# Aggregates
###########

query error Internal error: Min/Max accumulator not implemented for type List
query ?
SELECT min(column1) FROM data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add tests for empty input on min/max as well to test boundary conditions? Something like

SELECT MIN(column1) FROM data WHERE false

Which should return a null value?


// Create a null row to use for filtering out nulls from the input
let null_row = {
let null_array = ScalarValue::try_from(datatype)?.to_array_of_size(1)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let null_array = ScalarValue::try_from(datatype)?.to_array_of_size(1)?;
let null_array = new_null_array(datatype, 1);

}

#[test]
fn basic_i32_list() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


statement count 0
create table t1 (a int[]) as values ([1,2,3]), ([1,2,1]);

query ?
select min(a) from t1;
----
[1, 2, 1]

statement count 0
drop table t1;

You can create test like this, it is easier to maintain than this rust test

@alamb alamb marked this pull request as draft January 16, 2025 22:22
@alamb
Copy link
Contributor

alamb commented Jan 16, 2025

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

Specifically I think this PR just needs some more tests and it will be good to go

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Mar 18, 2025
@github-actions github-actions bot closed this Mar 25, 2025
@ologlogn
Copy link

ologlogn commented Apr 9, 2025

Hey @alamb @jayzhan211, I see that this PR is closed, is it possible to reopen this and complete this feature?
Are there any concerns apart from the tests? We can also take up this code and finish missing requirements.
Issue: #15477
cc @gabotechs @LiaCastaneda

@alamb
Copy link
Contributor

alamb commented Apr 10, 2025

Hey @alamb @jayzhan211, I see that this PR is closed, is it possible to reopen this and complete this feature? Are there any concerns apart from the tests? We can also take up this code and finish missing requirements. Issue: #15477 cc @gabotechs @LiaCastaneda

Thanks @ologlogn ! From my perspective this test simply needed some more slt tests

Specifically I think this PR just needs some more tests and it will be good to go

What I would suggest is creating a new PR using the code in this branch as a starting point and then add more tests, get it ready for review.

fyi @rluvaton

@rluvaton
Copy link
Contributor Author

rluvaton commented Apr 10, 2025

I did not meant to close it, sorry, it is in the back of my mind, I'll finish this now, @ologlogn don't take it 😀

@ologlogn
Copy link

Thanks @rluvaton! It will be really helpful :)

@gabotechs
Copy link
Contributor

@rluvaton do you have an estimate of when this might be shipped? We’re currently blocked by this, so we’d be glad to handle it ourselves if that would help ease your workload.

@rluvaton
Copy link
Contributor Author

I'm really sorry, had crazy week with the baby, will work on it today

@rluvaton
Copy link
Contributor Author

Sorry, I could not find time today, you can take this but please add me as co-author for my changes:

Creating a commit with multiple authors

@gabotechs
Copy link
Contributor

Work resumed in #15857

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

min and max should support lists as well
5 participants