Skip to content

Handling Unsupported Arrow Types in Parquet #1666

@tustvold

Description

@tustvold

Problem

Data in parquet is stored as one of a limited number of physical types, there are then three mechanisms that an arrow reader can use to infer the type of the data.

  1. The deprecated ConvertedType enumeration stored within the parquet schema
  2. The LogicalType enumeration stored within the parquet schema
  3. An embedded arrow schema stored within the parquet file metadata

All parquet readers support 1, all v2 readers support 2, and only arrow readers support 3.

In some cases the logical type is not necessary to correctly interpret the data, e.g. String vs LargeString, but in some cases it fundamentally alters the meaning of the data.

Timestamp

A nanosecond time units is not supported in 1, and a second time unit is only natively supported in 3.

Currently this crate encodes second time units as logical type of milliseconds, this is likely a bug. It also does not convert to nanoseconds to microseconds when using writer version 1, despite this only being supported in >2.6.

The python implementation will, depending on the writer version, potentially cast nanoseconds to microseconds, and seconds to milliseconds - see here

There does not appear to be a way to round-trip timezones in 1 or 2. The C++ implementation appears to always normalize timezones to UTC and set is_adjusted_to_utc to true.

Currently this crate sets is_adjusted_to_utc to true if the timezone is set, this is despite the writer not actually performing the normalisation. I think this is a bug.

Date64

The arrow type Date64 is milliseconds since epoch and does not have an equivalent ConvertedType nor LogicalType.

Currently this crate converts the type to Date32 on write, losing sub-second precision. This what the C++ implementation does - see here.

Interval

The interval data type has a ConvertedType but not a LogicalType, there is a PR to add LogicalType support apache/parquet-format#165 but it appears to have stalled somewhat.

An interval of MonthDayNano cannot be represented by the ConvertedType. The C++ implementation does not appear to support Interval types.

Proposal

There are broadly speaking 3 ways to handle data types that cannot be represented in the parquet schema:

  • Return an error
  • Cast to a native parquet representation, potentially losing precision in the process
  • Encode the data, only encoding its logical type in the embedded arrow schema

Returning an error is the safest, but is not a great UX. Casting to a native parquet representation is inline with what other arrow implementations do and gives the best ecosystem compatibility, but also doesn't make for a great UX, just search StackOverflow for allow_truncated_timestamps to see the confusion this causes.

The final option is the simplest to implement, the least surprising to users, and what I would propose implementing. It would break ecosystem interoperability in certain cases, but I think it is more important that we faithfully round-trip data than maintain maximal compatibility.

Users who care about interoperability can explicitly cast data as appropriate, similar to python's coerce_timestamps functionality, or just restrict themselves to using a supported subset of arrow types.

Thoughts @sunchao @nevi-me @alamb @jorisvandenbossche @jorgecarleitao

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    parquetChanges to the parquet cratequestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions