Parquet: read/write f16 for Arrow by Jefffrey · Pull Request #5003 · apache/arrow-rs

Jefffrey · 2023-10-29T10:53:10Z

Which issue does this PR close?

Closes #4986

Rationale for this change

What changes are included in this PR?

Allow reading and writing f16 type from parquet to/from Arrow recordbatches

Also support in ColumnReader API

And handle writing statistics properly

Are there any user-facing changes?

Jefffrey

I hope I've covered all the necessary areas for just reading/writing arrow

Won't include changes for the ColumnReader API files in interest of scope/PR size (still WIP anyway)

Jefffrey · 2023-10-29T10:56:23Z

parquet/src/file/statistics.rs

        distinct_count: stats.distinct_count().map(|value| value as i64),
        max_value: None,
        min_value: None,
+        is_max_value_exact: None,


Due to apache/parquet-format@31f92c7

Unsure if there is more work required to support this, that might require a separate issue?

Seems might be covered by/related to #5037

tustvold

I think before we merge this we should get a file containing this type added to parquet-testing, so that we can ensure interoperability with other implementations. Otherwise this looks reasonable to me

parquet/src/arrow/arrow_writer/mod.rs

parquet/src/arrow/arrow_reader/mod.rs

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

Jefffrey · 2023-10-30T20:46:37Z

I think before we merge this we should get a file containing this type added to parquet-testing, so that we can ensure interoperability with other implementations. Otherwise this looks reasonable to me

There does seem to be this PR for parquet-testing: apache/parquet-testing#40

Jefffrey · 2023-11-07T02:58:52Z

Just realized I need to also fix the statistics handling as well, otherwise I think it might write incorrect stats for files

Jefffrey

PR scope extended to include all support for f16 (including in ColumnReader API)

Also reverted formatting changes to not pollute the PR

Jefffrey · 2023-11-07T04:34:01Z

parquet/src/column/writer/mod.rs

+fn is_nan<T: ParquetValueType>(descr: &ColumnDescriptor, val: &T) -> bool {
    match T::PHYSICAL_TYPE {
        Type::FLOAT | Type::DOUBLE => val != val,
+        Type::FIXED_LEN_BYTE_ARRAY if descr.logical_type() == Some(LogicalType::Float16) => {
+            let val = val.as_bytes();
+            let val = f16::from_le_bytes([val[0], val[1]]);
+            val.is_nan()
+        }


Purely from type T we can't determine if the FixedLenByteArray represents a Float16 logical type, hence need ColumnDescriptor param to provide this information, to subsequently check NaN for f16

Jefffrey · 2023-11-07T04:34:23Z

parquet/src/column/writer/mod.rs

+    if let Some(LogicalType::Float16) = descr.logical_type() {
+        let a = a.as_bytes();
+        let a = f16::from_le_bytes([a[0], a[1]]);
+        let b = b.as_bytes();
+        let b = f16::from_le_bytes([b[0], b[1]]);
+        return a > b;
+    }


Don't compare as bytes, compare as f16's

Jefffrey · 2023-11-07T04:34:54Z

parquet/src/column/writer/mod.rs

+    #[test]
+    fn test_column_writer_check_float16_nan_middle() {
+        let input = [f16::ONE, f16::NAN, f16::ONE + f16::ONE]
+            .into_iter()
+            .map(|s| ByteArray::from(s).into())
+            .collect::<Vec<_>>();
+
+        let stats = float16_statistics_roundtrip(&input);
+        assert!(stats.has_min_max_set());
+        assert!(stats.is_min_max_backwards_compatible());
+        assert_eq!(stats.min(), &ByteArray::from(f16::ONE));
+        assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE));
+    }
+
+    #[test]
+    fn test_float16_statistics_nan_middle() {
+        let input = [f16::ONE, f16::NAN, f16::ONE + f16::ONE]
+            .into_iter()
+            .map(|s| ByteArray::from(s).into())
+            .collect::<Vec<_>>();
+
+        let stats = float16_statistics_roundtrip(&input);
+        assert!(stats.has_min_max_set());
+        assert!(stats.is_min_max_backwards_compatible());
+        assert_eq!(stats.min(), &ByteArray::from(f16::ONE));
+        assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE));
+    }
+
+    #[test]
+    fn test_float16_statistics_nan_start() {
+        let input = [f16::NAN, f16::ONE, f16::ONE + f16::ONE]
+            .into_iter()
+            .map(|s| ByteArray::from(s).into())
+            .collect::<Vec<_>>();
+
+        let stats = float16_statistics_roundtrip(&input);
+        assert!(stats.has_min_max_set());
+        assert!(stats.is_min_max_backwards_compatible());
+        assert_eq!(stats.min(), &ByteArray::from(f16::ONE));
+        assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE));
+    }
+
+    #[test]
+    fn test_float16_statistics_nan_only() {
+        let input = [f16::NAN, f16::NAN]
+            .into_iter()
+            .map(|s| ByteArray::from(s).into())
+            .collect::<Vec<_>>();
+
+        let stats = float16_statistics_roundtrip(&input);
+        assert!(!stats.has_min_max_set());
+        assert!(stats.is_min_max_backwards_compatible());
+    }


Ensuring NaN's are handled correctly in statistics (i.e. are ignored)

tustvold · 2023-11-07T11:49:26Z

Thank you for this, in the interests of saving time I intend to review this after apache/parquet-testing#40 has merged.

pitrou · 2023-11-09T14:21:52Z

@benibus Do you feel like taking a look at this PR?

benibus

Thanks for this!. I'm not familiar with this codebase and I have very limited Rust knowledge... but on a high level, this seems pretty good to me.

benibus · 2023-11-10T17:19:13Z

parquet/src/column/writer/mod.rs

@@ -1170,6 +1188,7 @@ fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
 mod tests {
    use crate::{file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, format::BoundaryOrder};


Are there tests for ensuring that BoundaryOrder is deduced correctly (i.e. not like a normal fixed binary)? I'd imagine that would be automatically handled by the Float16-specific comparators included here?

Ah, I don't think I've considered that (nor truncation of statistics). Thanks for pointing that out, I'll check what changes will need to be made for those

Actually in regards to truncation, is it intended to disallow truncation of f16 stats or it should take place anyway?

Nowhere in the new spec for f16 does it mention this case, and neither did I see relevant changes in apache/arrow#36073, but perhaps I missed it?

e.g. if user set the truncation for column index length to 1 byte, should we still truncate f16 to one byte since its underlying representation if Fixed len byte array, or should leave it as 2 bytes as truncating f16 doesn't make sense since it doesn't follow the sort order for fixed len byte arrays?

@benibus

Probably better to ignore the truncation limit in this case, IMO.

Ok, so will need to insert special case for Float16 to ensure its stats won't get truncated

Also regarding the BoundaryOrder it seems it isn't being derived in general yet, for any other types:

arrow-rs/parquet/src/file/metadata.rs

Lines 888 to 889 in 31b5724

// TODO: calc the order for all pages in this column

boundary_order: BoundaryOrder,

I couldn't find where it might be set, so this could be a separate issue to be done.

tustvold

This looks good and well tested, thank you 👍

tustvold · 2023-11-13T23:12:15Z

I'm going to merge this as I think it moves things forward. Any follow up from #5003 (comment) I think can safely be handled in a subsequent PR

Jefffrey · 2023-11-14T12:17:31Z

I've raised #5075 and #5074 to follow up on the discussion points

…arquet-testing (#38753) ### Rationale for this change Validates compatibility between implementations when reading `Float16` columns. ### What changes are included in this PR? - Bumps `parquet-testing` commit to latest to use the recently-added files - Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003 ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: #38751 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

…s in parquet-testing (apache#38753) ### Rationale for this change Validates compatibility between implementations when reading `Float16` columns. ### What changes are included in this PR? - Bumps `parquet-testing` commit to latest to use the recently-added files - Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003 ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: apache#38751 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

…arquet-testing (#38753) ### Rationale for this change Validates compatibility between implementations when reading `Float16` columns. ### What changes are included in this PR? - Bumps `parquet-testing` commit to latest to use the recently-added files - Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003 ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: #38751 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

Support for read/write f16 Parquet to Arrow

a38ca5a

github-actions bot added the parquet Changes to the parquet crate label Oct 29, 2023

Jefffrey commented Oct 29, 2023

View reviewed changes

Jefffrey marked this pull request as ready for review October 29, 2023 10:58

tustvold reviewed Oct 30, 2023

View reviewed changes

parquet/src/arrow/arrow_writer/mod.rs Outdated Show resolved Hide resolved

parquet/src/arrow/arrow_reader/mod.rs Outdated Show resolved Hide resolved

tustvold added the api-change Changes to the arrow API label Oct 30, 2023

Jefffrey and others added 3 commits October 31, 2023 07:44

Update parquet/src/arrow/arrow_writer/mod.rs

ef66642

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

Update parquet/src/arrow/arrow_reader/mod.rs

517aebe

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

Update test with null version

64b0e50

tustvold mentioned this pull request Oct 31, 2023

PARQUET-758: Add files with Float16 column apache/parquet-testing#40

Merged

Fix schema tests and parsing for f16

bf43eea

Jefffrey marked this pull request as draft November 7, 2023 02:58

Jefffrey added 5 commits November 7, 2023 13:59

f16 for record api

af39f80

Handle NaN for f16 statistics writing

40f3e5f

Revert formatting changes

3a8bec0

Fix num trait

fa70501

Fix half feature

21cefd9

Jefffrey commented Nov 7, 2023

View reviewed changes

Jefffrey marked this pull request as ready for review November 7, 2023 04:44

This was referenced Nov 7, 2023

Parquet: handle signed floating point zeros in statistics #5047

Closed

Parquet: Fix string display for f32/f64 zero values #5046

Closed

Jefffrey added 3 commits November 8, 2023 06:48

Merge branch 'master' into parquet_f16_arrow

0b7111e

Handle writing signed zero statistics

b122f47

Merge branch 'master' into parquet_f16_arrow

e520f07

Jefffrey added 2 commits November 10, 2023 07:43

Merge branch 'master' into parquet_f16_arrow

a6f25da

Bump parquet-testing and read new f16 files for test

9d424cc

benibus reviewed Nov 10, 2023

View reviewed changes

tustvold approved these changes Nov 13, 2023

View reviewed changes

tustvold merged commit 7ba36b0 into apache:master Nov 13, 2023

Jefffrey deleted the parquet_f16_arrow branch November 14, 2023 08:46

This was referenced Nov 14, 2023

Parquet: derive boundary order when writing columns #5074

Closed

Parquet: don't truncate min/max statistics for float16 and decimal when writing file #5075

Closed

This was referenced Nov 16, 2023

[C++][Go][Parquet] Utilize new parquet-testing files in Float16 tests apache/arrow#38751

Closed

GH-38751: [C++][Go][Parquet] Add tests for reading Float16 files in parquet-testing apache/arrow#38753

Merged

tustvold mentioned this pull request Jan 5, 2024

Add Float16/Half-float logical type to Parquet #4986

Closed

		@@ -1170,6 +1188,7 @@ fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
		mod tests {
		use crate::{file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, format::BoundaryOrder};

	// TODO: calc the order for all pages in this column
	boundary_order: BoundaryOrder,

Conversation

Jefffrey commented Oct 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Jefffrey commented Oct 30, 2023

Uh oh!

Jefffrey commented Nov 7, 2023

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold commented Nov 7, 2023

Uh oh!

pitrou commented Nov 9, 2023

Uh oh!

benibus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold left a comment

Choose a reason for hiding this comment

Uh oh!

tustvold commented Nov 13, 2023

Uh oh!

Jefffrey commented Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jefffrey commented Oct 29, 2023 •

edited

Loading

Jefffrey commented Nov 14, 2023 •

edited

Loading