-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
See discussion:
#5003 added support for float16 type which is a logical type on top of fixed len byte array
When writing statistics, truncation can occur for binary physical type:
arrow-rs/parquet/src/column/writer/mod.rs
Lines 634 to 643 in 7ba36b0
| // We only truncate if the data is represented as binary | |
| match self.descr.physical_type() { | |
| Type::BYTE_ARRAY | Type::FIXED_LEN_BYTE_ARRAY => { | |
| self.column_index_builder.append( | |
| null_page, | |
| self.truncate_min_value(stat.min_bytes()), | |
| self.truncate_max_value(stat.max_bytes()), | |
| self.page_metrics.num_page_nulls as i64, | |
| ); | |
| } |
Which might be troublesome for f16 type, if the column_index_truncate_length config is set to 1, as a truncated f16 wouldn't represent the min and max correctly anymore as it has a sort order different from fixed len byte array
Describe the solution you'd like
Ignore truncation for f16 when writing min/max statistics
Describe alternatives you've considered
Additional context
Do we need to worry about this for other types based on binary physical types? i.e. decimal