Skip to content

Parquet: don't truncate min/max statistics for float16 and decimal when writing file #5075

@Jefffrey

Description

@Jefffrey

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

See discussion:

#5003 (comment)

#5003 added support for float16 type which is a logical type on top of fixed len byte array

When writing statistics, truncation can occur for binary physical type:

// We only truncate if the data is represented as binary
match self.descr.physical_type() {
Type::BYTE_ARRAY | Type::FIXED_LEN_BYTE_ARRAY => {
self.column_index_builder.append(
null_page,
self.truncate_min_value(stat.min_bytes()),
self.truncate_max_value(stat.max_bytes()),
self.page_metrics.num_page_nulls as i64,
);
}

Which might be troublesome for f16 type, if the column_index_truncate_length config is set to 1, as a truncated f16 wouldn't represent the min and max correctly anymore as it has a sort order different from fixed len byte array

Describe the solution you'd like

Ignore truncation for f16 when writing min/max statistics

Describe alternatives you've considered

Additional context

Do we need to worry about this for other types based on binary physical types? i.e. decimal

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions