Skip to content

Conversation

@dharanad
Copy link
Contributor

@dharanad dharanad commented Jul 1, 2024

Which issue does this PR close?

Closes #11184

Rationale for this change

What changes are included in this PR?

Are these changes tested?

There are existing test cases

cargo test

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 1, 2024
Comment on lines 920 to 934
},
Some(DataType::FixedSizeBinary(size)) => {
Ok(Arc::new(
FixedSizeBinaryArray::from(
[<$stat_type_prefix FixedLenByteArrayDataPageStatsIterator>]::new($iterator).map(|x| {
x.into_iter().filter_map(|x| x).map(|x| {
if x.len().try_into() == Ok(*size) {
Some(x.data().to_vec())
} else {
None
}
})
}).flatten().collect::<Vec<_>>()
)
))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @efredine I'm reaching out because I'm encountering an issue, though you can help me over here.

I have tried something like this to solve this issues, but it is failing to fight with borrow checker. I have refered the docs and tried couple of other approachs but nothing worked.

error[E0515]: cannot return value referencing function parameter `x`
    --> datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:927:41
     |
927  |                                         Some(x.data())
     |                                         ^^^^^-^^^^^^^^
     |                                         |    |
     |                                         |    `x` is borrowed here
     |                                         returns a value referencing data owned by the current function
...
1009 |     get_data_page_statistics!(Max, data_type, iterator)
     |     --------------------------------------------------- in this macro invocation
     |
     = note: this error originates in the macro `get_data_page_statistics` (in Nightly builds, run with -Z macro-backtrace for more info)

I was wondering if you might have any insights or suggestions that could help me navigate this borrow checker error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work, but feels like it should be possible to simplify it:

                Some(DataType::FixedSizeBinary(size)) => {
                    Ok(Arc::new(FixedSizeBinaryArray::from(
                        [<$stat_type_prefix FixedLenByteArrayDataPageStatsIterator>]::new($iterator)
                            .flat_map(|x| x.into_iter()) // Convert Vec<Option<FixedLenByteArray>> into an iterator of Option<FixedLenByteArray>
                            .filter_map(|x| x) // Filter out None values, yielding FixedLenByteArray
                            .map(|x| x.data().to_vec()) // Convert FixedLenByteArray to Vec<u8>
                            .filter(|x| x.len().try_into() == Ok(*size)) // Ensure the byte vector has the correct size
                            .collect::<Vec<Vec<u8>>>() // Collect into Vec<Vec<u8>>
                            .iter() // Iterate over Vec<Vec<u8>>
                            .map(|x| x.as_slice()) // Convert Vec<u8> to &[u8]
                            .collect::<Vec<&[u8]>>() // Collect into Vec<&[u8]>
                    )))
                },

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - it seems like the second collect is necessary to return a final Vec that owns the slices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - after some further research, the trick here is to use try_from_iter instead. So:

               Some(DataType::FixedSizeBinary(size)) => {
                    Ok(Arc::new(
                        FixedSizeBinaryArray::try_from_iter(
                            [<$stat_type_prefix FixedLenByteArrayDataPageStatsIterator>]::new($iterator).map(|x| {
                                x.into_iter().filter_map(|x| x.and_then(|x| {
                                    if x.len().try_into() == Ok(*size) {
                                        Some(x)
                                    } else {
                                        log::debug!(
                                            "FixedSizeBinary({}) statistics is a binary of size {}, ignoring it.",
                                            size,
                                            x.len(),
                                        );
                                        None
                                    }
                                }))
                            })
                            .flatten()
                        ).unwrap_or_else(|_| {
                            FixedSizeBinaryArray::new(*size, vec![].into(), None)
                        })
                    ))
                },

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, looking even further - this loop is just replicating the logic in try_from_iter. So this can be further simplified to:

               Some(DataType::FixedSizeBinary(size)) => {
                    Ok(Arc::new(
                        FixedSizeBinaryArray::try_from_iter(
                            [<$stat_type_prefix FixedLenByteArrayDataPageStatsIterator>]::new($iterator)
                            .flat_map(|x| x.into_iter())
                            .filter_map(|x| x)
                        ).unwrap_or_else(|e| {
                            log::debug!("FixedSizeBinary statistics is invalid: {}", e);
                            FixedSizeBinaryArray::new(*size, vec![].into(), None)
                        })
                    ))
                },

Which will also be more efficient because it's not performing the same check twice. And we should probably apply this change to the RowGroup statistics accumulator as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are lifesaver @efredine. Today morning i was looking into try_from_iter and you already have solution there.
Thank you so much for helping

@dharanad dharanad marked this pull request as ready for review July 3, 2024 10:11
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome -- thank you @dharanad and @efredine -- we are getting so close

@alamb alamb merged commit dc7535a into apache:main Jul 3, 2024
@dharanad dharanad deleted the fsba-parquet-data-page-statistics branch July 3, 2024 18:45
comphead pushed a commit to comphead/arrow-datafusion that referenced this pull request Jul 8, 2024
)

* failing add data page stats for fixed size binary array

* fix

* failing fix

* fix
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
)

* failing add data page stats for fixed size binary array

* fix

* failing fix

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support FixedSizedBinaryArray Parquet Data Page Statistics

3 participants