-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Support array_distinct function. #8268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2010YOUY01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the new function support, I did some tests and have a few suggestions:
❯ select array_distinct([]);
Optimizer rule 'simplify_expressions' failed
caused by
Internal error: could not cast value to arrow_array::array::list_array::GenericListArray<i32>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
I think this empty array case should be handled inside implementation (and also included in sqllogictest)
|
I merged #8269 so we can probably pick up the change for this PR |
fe5bd05 to
1e615b2
Compare
|
PTAL, @alamb @jayzhan211 @2010YOUY01 , thanks. |
1e615b2 to
ec3d443
Compare
Weijun-H
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 👍
5f42d9c to
86ccb87
Compare
| let converter = RowConverter::new(vec![SortField::new(dt.clone())])?; | ||
| // distinct for each list in ListArray | ||
| for arr in array.iter().flatten() { | ||
| let values = converter.convert_columns(&[arr])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not distinct array in columnar way, arr has only one column, using row format need extra encoding and decoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not distinct array in columnar way,
arrhas only one column, using row format need extra encoding and decoding
It is great to distinct array without row converter, but I don't think we can do that without downcast to exact arr then do the distinction. Is there any recommended way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't have another idea other than downcast arr, I was just wondering if it is worth to downcast to exact arr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't have another idea other than downcast arr, I was just wondering if it is worth to downcast to exact arr.
Downcasting to the exact array type can result in faster code in many cases, as the rust compiler can make specialized implemenations for each type. However, there are a lot of DataTypes, including nested ones like Dict, List, Struct, etc so making specialized implementations often requires a lot of work
The row converter handles all the types internally.
What we have typically done in the past with DataFusion is to use non type specific code like RowConverter for the general case, and then if we find a particular usecase needs faster performance we make special implementations. For example, we do so for grouing by single primtive columns (GROUP BY int32) for example
ccfed93 to
9e22960
Compare
jayzhan211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
9e22960 to
62f11f5
Compare
implement slt & proto fix null & empty list
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
62f11f5 to
c2f5451
Compare
|
Since no more comments for a fews days, I think maybe this pr can go ahead? |
|
Thanks @my-vegetable-has-exploded -- I'll take a look hopefully today or maybe tomorrow |
| let array = as_list_array(&args[0])?; | ||
| general_array_distinct(array, field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put let array = as_list_array(&args[0])?; in general_array_distinct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't get your point. Iargelist differs with list, so I think it maybe better to handle it before generic function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean change the function signature:
general_array_distinct<OffsetSize: OffsetSizeTrait>(
array: &ArrayRef,
field: &FieldRef,
)Then cast array in the fucntion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are referring to something like general_array_has_dispatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge this PR as is and then add support for LargeList (using the OffsetSize trait) as a follow on PR
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THank you for this contribution @my-vegetable-has-exploded and thank you @Weijun-H and @jayzhan211 for the help getting this PR ready.
I think it looks very nice and is a good example of collaboration 🦾
| let converter = RowConverter::new(vec![SortField::new(dt.clone())])?; | ||
| // distinct for each list in ListArray | ||
| for arr in array.iter().flatten() { | ||
| let values = converter.convert_columns(&[arr])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't have another idea other than downcast arr, I was just wondering if it is worth to downcast to exact arr.
Downcasting to the exact array type can result in faster code in many cases, as the rust compiler can make specialized implemenations for each type. However, there are a lot of DataTypes, including nested ones like Dict, List, Struct, etc so making specialized implementations often requires a lot of work
The row converter handles all the types internally.
What we have typically done in the past with DataFusion is to use non type specific code like RowConverter for the general case, and then if we find a particular usecase needs faster performance we make special implementations. For example, we do so for grouing by single primtive columns (GROUP BY int32) for example
| let array = as_list_array(&args[0])?; | ||
| general_array_distinct(array, field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge this PR as is and then add support for LargeList (using the OffsetSize trait) as a follow on PR
|
I took the liberty of merging up from main to make sure there there are no logical conflicts. I intend to merge the PR when the tests pass |
|
Thanks all. |
* implement distinct func implement slt & proto fix null & empty list * add comment for slt Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * fix largelist * add largelist for slt * Use collect for rows & init capcity for offsets. * fixup: remove useless match * fix fmt * fix fmt --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #7289
Rationale for this change
just use list.iter().sorted().dedup() to remove duplicates for each list in listarray
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?