-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support array_distinct function. #8268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
430caa2
2ffc9f8
c777184
009d6de
7f0152e
c2f5451
ac33215
18a6e84
9d8bc00
c5960dd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,8 +31,8 @@ use arrow_buffer::NullBuffer; | |
|
||
use arrow_schema::{FieldRef, SortOptions}; | ||
use datafusion_common::cast::{ | ||
as_generic_list_array, as_generic_string_array, as_int64_array, as_list_array, | ||
as_null_array, as_string_array, | ||
as_generic_list_array, as_generic_string_array, as_int64_array, as_large_list_array, | ||
as_list_array, as_null_array, as_string_array, | ||
}; | ||
use datafusion_common::utils::{array_into_list_array, list_ndims}; | ||
use datafusion_common::{ | ||
|
@@ -2111,6 +2111,66 @@ pub fn array_intersect(args: &[ArrayRef]) -> Result<ArrayRef> { | |
} | ||
} | ||
|
||
pub fn general_array_distinct<OffsetSize: OffsetSizeTrait>( | ||
array: &GenericListArray<OffsetSize>, | ||
field: &FieldRef, | ||
) -> Result<ArrayRef> { | ||
let dt = array.value_type(); | ||
let mut offsets = Vec::with_capacity(array.len()); | ||
offsets.push(OffsetSize::usize_as(0)); | ||
let mut new_arrays = Vec::with_capacity(array.len()); | ||
let converter = RowConverter::new(vec![SortField::new(dt.clone())])?; | ||
// distinct for each list in ListArray | ||
for arr in array.iter().flatten() { | ||
let values = converter.convert_columns(&[arr])?; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why not distinct array in columnar way, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It is great to distinct array without row converter, but I don't think we can do that without downcast to exact arr then do the distinction. Is there any recommended way? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also don't have another idea other than downcast arr, I was just wondering if it is worth to downcast to exact arr. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Downcasting to the exact array type can result in faster code in many cases, as the rust compiler can make specialized implemenations for each type. However, there are a lot of DataTypes, including nested ones like Dict, List, Struct, etc so making specialized implementations often requires a lot of work The row converter handles all the types internally. What we have typically done in the past with DataFusion is to use non type specific code like |
||
// sort elements in list and remove duplicates | ||
let rows = values.iter().sorted().dedup().collect::<Vec<_>>(); | ||
let last_offset: OffsetSize = offsets.last().copied().unwrap(); | ||
offsets.push(last_offset + OffsetSize::usize_as(rows.len())); | ||
let arrays = converter.convert_rows(rows)?; | ||
let array = match arrays.get(0) { | ||
Some(array) => array.clone(), | ||
None => { | ||
return internal_err!("array_distinct: failed to get array from rows") | ||
} | ||
}; | ||
new_arrays.push(array); | ||
} | ||
let offsets = OffsetBuffer::new(offsets.into()); | ||
let new_arrays_ref = new_arrays.iter().map(|v| v.as_ref()).collect::<Vec<_>>(); | ||
let values = compute::concat(&new_arrays_ref)?; | ||
Ok(Arc::new(GenericListArray::<OffsetSize>::try_new( | ||
field.clone(), | ||
offsets, | ||
values, | ||
None, | ||
)?)) | ||
} | ||
|
||
/// array_distinct SQL function | ||
/// example: from list [1, 3, 2, 3, 1, 2, 4] to [1, 2, 3, 4] | ||
pub fn array_distinct(args: &[ArrayRef]) -> Result<ArrayRef> { | ||
assert_eq!(args.len(), 1); | ||
|
||
// handle null | ||
if args[0].data_type() == &DataType::Null { | ||
return Ok(args[0].clone()); | ||
} | ||
|
||
// handle for list & largelist | ||
match args[0].data_type() { | ||
DataType::List(field) => { | ||
let array = as_list_array(&args[0])?; | ||
general_array_distinct(array, field) | ||
Comment on lines
+2163
to
+2164
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: put There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I don't get your point. Iargelist differs with list, so I think it maybe better to handle it before generic function? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean change the function signature: general_array_distinct<OffsetSize: OffsetSizeTrait>(
array: &ArrayRef,
field: &FieldRef,
) Then cast array in the fucntion There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you are referring to something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can merge this PR as is and then add support for |
||
} | ||
DataType::LargeList(field) => { | ||
let array = as_large_list_array(&args[0])?; | ||
general_array_distinct(array, field) | ||
} | ||
_ => internal_err!("array_distinct only support list array"), | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use super::*; | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Uh oh!
There was an error while loading. Please reload this page.