Parquet: read/write f16 for Arrow#5003
Conversation
Jefffrey
left a comment
There was a problem hiding this comment.
I hope I've covered all the necessary areas for just reading/writing arrow
Won't include changes for the ColumnReader API files in interest of scope/PR size (still WIP anyway)
| distinct_count: stats.distinct_count().map(|value| value as i64), | ||
| max_value: None, | ||
| min_value: None, | ||
| is_max_value_exact: None, |
There was a problem hiding this comment.
Due to apache/parquet-format@31f92c7
Unsure if there is more work required to support this, that might require a separate issue?
There was a problem hiding this comment.
Seems might be covered by/related to #5037
tustvold
left a comment
There was a problem hiding this comment.
I think before we merge this we should get a file containing this type added to parquet-testing, so that we can ensure interoperability with other implementations. Otherwise this looks reasonable to me
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
There does seem to be this PR for parquet-testing: apache/parquet-testing#40 |
|
Just realized I need to also fix the statistics handling as well, otherwise I think it might write incorrect stats for files |
Jefffrey
left a comment
There was a problem hiding this comment.
PR scope extended to include all support for f16 (including in ColumnReader API)
Also reverted formatting changes to not pollute the PR
| fn is_nan<T: ParquetValueType>(descr: &ColumnDescriptor, val: &T) -> bool { | ||
| match T::PHYSICAL_TYPE { | ||
| Type::FLOAT | Type::DOUBLE => val != val, | ||
| Type::FIXED_LEN_BYTE_ARRAY if descr.logical_type() == Some(LogicalType::Float16) => { | ||
| let val = val.as_bytes(); | ||
| let val = f16::from_le_bytes([val[0], val[1]]); | ||
| val.is_nan() | ||
| } |
There was a problem hiding this comment.
Purely from type T we can't determine if the FixedLenByteArray represents a Float16 logical type, hence need ColumnDescriptor param to provide this information, to subsequently check NaN for f16
| if let Some(LogicalType::Float16) = descr.logical_type() { | ||
| let a = a.as_bytes(); | ||
| let a = f16::from_le_bytes([a[0], a[1]]); | ||
| let b = b.as_bytes(); | ||
| let b = f16::from_le_bytes([b[0], b[1]]); | ||
| return a > b; | ||
| } |
There was a problem hiding this comment.
Don't compare as bytes, compare as f16's
| #[test] | ||
| fn test_column_writer_check_float16_nan_middle() { | ||
| let input = [f16::ONE, f16::NAN, f16::ONE + f16::ONE] | ||
| .into_iter() | ||
| .map(|s| ByteArray::from(s).into()) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| let stats = float16_statistics_roundtrip(&input); | ||
| assert!(stats.has_min_max_set()); | ||
| assert!(stats.is_min_max_backwards_compatible()); | ||
| assert_eq!(stats.min(), &ByteArray::from(f16::ONE)); | ||
| assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE)); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_float16_statistics_nan_middle() { | ||
| let input = [f16::ONE, f16::NAN, f16::ONE + f16::ONE] | ||
| .into_iter() | ||
| .map(|s| ByteArray::from(s).into()) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| let stats = float16_statistics_roundtrip(&input); | ||
| assert!(stats.has_min_max_set()); | ||
| assert!(stats.is_min_max_backwards_compatible()); | ||
| assert_eq!(stats.min(), &ByteArray::from(f16::ONE)); | ||
| assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE)); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_float16_statistics_nan_start() { | ||
| let input = [f16::NAN, f16::ONE, f16::ONE + f16::ONE] | ||
| .into_iter() | ||
| .map(|s| ByteArray::from(s).into()) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| let stats = float16_statistics_roundtrip(&input); | ||
| assert!(stats.has_min_max_set()); | ||
| assert!(stats.is_min_max_backwards_compatible()); | ||
| assert_eq!(stats.min(), &ByteArray::from(f16::ONE)); | ||
| assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE)); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_float16_statistics_nan_only() { | ||
| let input = [f16::NAN, f16::NAN] | ||
| .into_iter() | ||
| .map(|s| ByteArray::from(s).into()) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| let stats = float16_statistics_roundtrip(&input); | ||
| assert!(!stats.has_min_max_set()); | ||
| assert!(stats.is_min_max_backwards_compatible()); | ||
| } |
There was a problem hiding this comment.
Ensuring NaN's are handled correctly in statistics (i.e. are ignored)
|
Thank you for this, in the interests of saving time I intend to review this after apache/parquet-testing#40 has merged. |
|
@benibus Do you feel like taking a look at this PR? |
benibus
left a comment
There was a problem hiding this comment.
Thanks for this!. I'm not familiar with this codebase and I have very limited Rust knowledge... but on a high level, this seems pretty good to me.
| @@ -1170,6 +1188,7 @@ fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> { | |||
| mod tests { | |||
| use crate::{file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, format::BoundaryOrder}; | |||
There was a problem hiding this comment.
Are there tests for ensuring that BoundaryOrder is deduced correctly (i.e. not like a normal fixed binary)? I'd imagine that would be automatically handled by the Float16-specific comparators included here?
There was a problem hiding this comment.
Ah, I don't think I've considered that (nor truncation of statistics). Thanks for pointing that out, I'll check what changes will need to be made for those
There was a problem hiding this comment.
Actually in regards to truncation, is it intended to disallow truncation of f16 stats or it should take place anyway?
Nowhere in the new spec for f16 does it mention this case, and neither did I see relevant changes in apache/arrow#36073, but perhaps I missed it?
e.g. if user set the truncation for column index length to 1 byte, should we still truncate f16 to one byte since its underlying representation if Fixed len byte array, or should leave it as 2 bytes as truncating f16 doesn't make sense since it doesn't follow the sort order for fixed len byte arrays?
There was a problem hiding this comment.
Probably better to ignore the truncation limit in this case, IMO.
There was a problem hiding this comment.
Ok, so will need to insert special case for Float16 to ensure its stats won't get truncated
Also regarding the BoundaryOrder it seems it isn't being derived in general yet, for any other types:
arrow-rs/parquet/src/file/metadata.rs
Lines 888 to 889 in 31b5724
I couldn't find where it might be set, so this could be a separate issue to be done.
tustvold
left a comment
There was a problem hiding this comment.
This looks good and well tested, thank you 👍
|
I'm going to merge this as I think it moves things forward. Any follow up from #5003 (comment) I think can safely be handled in a subsequent PR |
…arquet-testing (#38753) ### Rationale for this change Validates compatibility between implementations when reading `Float16` columns. ### What changes are included in this PR? - Bumps `parquet-testing` commit to latest to use the recently-added files - Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003 ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: #38751 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…s in parquet-testing (apache#38753) ### Rationale for this change Validates compatibility between implementations when reading `Float16` columns. ### What changes are included in this PR? - Bumps `parquet-testing` commit to latest to use the recently-added files - Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003 ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: apache#38751 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…arquet-testing (#38753) ### Rationale for this change Validates compatibility between implementations when reading `Float16` columns. ### What changes are included in this PR? - Bumps `parquet-testing` commit to latest to use the recently-added files - Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003 ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: #38751 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Which issue does this PR close?
Closes #4986
Rationale for this change
What changes are included in this PR?
Allow reading and writing f16 type from parquet to/from Arrow recordbatches
Also support in ColumnReader API
And handle writing statistics properly
Are there any user-facing changes?