-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Support FixedSizedBinaryArray Parquet Data Page Statistics #11200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| }, | ||
| Some(DataType::FixedSizeBinary(size)) => { | ||
| Ok(Arc::new( | ||
| FixedSizeBinaryArray::from( | ||
| [<$stat_type_prefix FixedLenByteArrayDataPageStatsIterator>]::new($iterator).map(|x| { | ||
| x.into_iter().filter_map(|x| x).map(|x| { | ||
| if x.len().try_into() == Ok(*size) { | ||
| Some(x.data().to_vec()) | ||
| } else { | ||
| None | ||
| } | ||
| }) | ||
| }).flatten().collect::<Vec<_>>() | ||
| ) | ||
| )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @efredine I'm reaching out because I'm encountering an issue, though you can help me over here.
I have tried something like this to solve this issues, but it is failing to fight with borrow checker. I have refered the docs and tried couple of other approachs but nothing worked.
error[E0515]: cannot return value referencing function parameter `x`
--> datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:927:41
|
927 | Some(x.data())
| ^^^^^-^^^^^^^^
| | |
| | `x` is borrowed here
| returns a value referencing data owned by the current function
...
1009 | get_data_page_statistics!(Max, data_type, iterator)
| --------------------------------------------------- in this macro invocation
|
= note: this error originates in the macro `get_data_page_statistics` (in Nightly builds, run with -Z macro-backtrace for more info)
I was wondering if you might have any insights or suggestions that could help me navigate this borrow checker error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to work, but feels like it should be possible to simplify it:
Some(DataType::FixedSizeBinary(size)) => {
Ok(Arc::new(FixedSizeBinaryArray::from(
[<$stat_type_prefix FixedLenByteArrayDataPageStatsIterator>]::new($iterator)
.flat_map(|x| x.into_iter()) // Convert Vec<Option<FixedLenByteArray>> into an iterator of Option<FixedLenByteArray>
.filter_map(|x| x) // Filter out None values, yielding FixedLenByteArray
.map(|x| x.data().to_vec()) // Convert FixedLenByteArray to Vec<u8>
.filter(|x| x.len().try_into() == Ok(*size)) // Ensure the byte vector has the correct size
.collect::<Vec<Vec<u8>>>() // Collect into Vec<Vec<u8>>
.iter() // Iterate over Vec<Vec<u8>>
.map(|x| x.as_slice()) // Convert Vec<u8> to &[u8]
.collect::<Vec<&[u8]>>() // Collect into Vec<&[u8]>
)))
},There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So - it seems like the second collect is necessary to return a final Vec that owns the slices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - after some further research, the trick here is to use try_from_iter instead. So:
Some(DataType::FixedSizeBinary(size)) => {
Ok(Arc::new(
FixedSizeBinaryArray::try_from_iter(
[<$stat_type_prefix FixedLenByteArrayDataPageStatsIterator>]::new($iterator).map(|x| {
x.into_iter().filter_map(|x| x.and_then(|x| {
if x.len().try_into() == Ok(*size) {
Some(x)
} else {
log::debug!(
"FixedSizeBinary({}) statistics is a binary of size {}, ignoring it.",
size,
x.len(),
);
None
}
}))
})
.flatten()
).unwrap_or_else(|_| {
FixedSizeBinaryArray::new(*size, vec![].into(), None)
})
))
},There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, looking even further - this loop is just replicating the logic in try_from_iter. So this can be further simplified to:
Some(DataType::FixedSizeBinary(size)) => {
Ok(Arc::new(
FixedSizeBinaryArray::try_from_iter(
[<$stat_type_prefix FixedLenByteArrayDataPageStatsIterator>]::new($iterator)
.flat_map(|x| x.into_iter())
.filter_map(|x| x)
).unwrap_or_else(|e| {
log::debug!("FixedSizeBinary statistics is invalid: {}", e);
FixedSizeBinaryArray::new(*size, vec![].into(), None)
})
))
},Which will also be more efficient because it's not performing the same check twice. And we should probably apply this change to the RowGroup statistics accumulator as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are lifesaver @efredine. Today morning i was looking into try_from_iter and you already have solution there.
Thank you so much for helping
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which issue does this PR close?
Closes #11184
Rationale for this change
What changes are included in this PR?
Are these changes tested?
There are existing test cases
Are there any user-facing changes?