Skip to content

feat(datafusion-functions-aggregate): add support for lists and other nested types in min and max #15857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

gabotechs
Copy link
Contributor

@gabotechs gabotechs commented Apr 25, 2025

Which issue does this PR close?

Rationale for this change

Resume the work started by @rluvaton in #13991, adding some tests and adding a small fix in the middle.

What changes are included in this PR?

Add support for performing MIN and MAX aggregations over lists, building on top of #13991 maintaining the original commit history, and only adding commits on top of it.

This PR also includes a commit for fixing ScalarValue::List comparisons that could be shipped independently:

Are these changes tested?

Yes, by new sqllogictests

Are there any user-facing changes?

Yes, users can now perform MIN and MAX operations over lists

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) common Related to common crate functions Changes to functions implementation labels Apr 25, 2025
@gabotechs gabotechs changed the title Support lists in min max feat(datafusion-functions-aggregate): add support for lists and other nested types in min and max Apr 25, 2025
@gabotechs gabotechs changed the title feat(datafusion-functions-aggregate): add support for lists and other nested types in min and max feat(datafusion-functions-aggregate): add support for lists and other nested types in min and max Apr 25, 2025
@gabotechs gabotechs marked this pull request as ready for review April 25, 2025 10:16
@github-actions github-actions bot removed the common Related to common crate label Apr 25, 2025
Comment on lines +729 to +730
} else if let DataType::List(field) = col_type {
extract_window_frame_target_type(field.data_type())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure what am I doing here. This seems necessary for window functions over min/max aggregations on lists to work, and does not seem to be breaking other things, but maybe more experienced DF folks can throw some light here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be related to converting window frames when they have RANGE rather than ROWS -- so maybe we need some more tests. I'll comment below

https://github.com/apache/datafusion/blob/86db4bf3c5af9704377afa1ecc596ce7d9167f7f/datafusion/optimizer/src/analyzer/type_coercion.rs#L749-L748

@alamb alamb mentioned this pull request Apr 29, 2025
26 tasks
@alamb
Copy link
Contributor

alamb commented May 5, 2025

Is this related to

@alamb alamb mentioned this pull request May 5, 2025
23 tasks
@gabotechs
Copy link
Contributor Author

Is this related to #15667 ?

Not intentionally, this is just the original min/max over lists PR from #13991, with some extra tests added on top. There might be some overlap though.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gabotechs -- I think just a few more tests are needed and this one will eb good to go.

FYI @rluvaton

Comment on lines +729 to +730
} else if let DataType::List(field) = col_type {
extract_window_frame_target_type(field.data_type())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be related to converting window frames when they have RANGE rather than ROWS -- so maybe we need some more tests. I'll comment below

https://github.com/apache/datafusion/blob/86db4bf3c5af9704377afa1ecc596ce7d9167f7f/datafusion/optimizer/src/analyzer/type_coercion.rs#L749-L748

Comment on lines +99 to +100
/// Accumulator for min/max of lists
/// this accumulator is using arrow-row as the internal representation and a way for comparing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Accumulator for min/max of lists
/// this accumulator is using arrow-row as the internal representation and a way for comparing
/// Accumulator for min/max of lists
///
/// This accumulator uses arrow-row as the internal representation and a way
/// for comparing rows.

fn value(&self) -> Option<&OwnedRow>;
}

#[derive(Debug, Default)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[derive(Debug, Default)]
/// Similarly to [`GenericMinMaxAccumulator`], implements sliding min/max for Lists
/// using the Row converter
#[derive(Debug, Default)]

@alamb
Copy link
Contributor

alamb commented May 6, 2025

Is this related to #15667 ?

Not intentionally, this is just the original min/max over lists PR from #13991, with some extra tests added on top. There might be some overlap though.

I think the overlap is that the approach in this PR (to use Row format) could have also been used in #15667.

As a follow on / if someone cares about performance, we could probably implement a much more efficient (though complex0 list comparison routine that compares element by element rather than copying the entire list element into a Row format

@gabotechs
Copy link
Contributor Author

After reviewing #15667, I think it makes more sense to reuse the work done there rather than what this PR proposes. Thanks @alamb for pointing that out!

I'll open another PR soon

@gabotechs
Copy link
Contributor Author

gabotechs commented May 12, 2025

Here's the new PR:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions Changes to functions implementation optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

min and max should support lists as well
3 participants