Handling Unsupported Arrow Types in Parquet

# Problem

Data in parquet is stored as one of a limited number of physical types, there are then three mechanisms that an arrow reader can use to infer the type of the data.

1. The deprecated ConvertedType enumeration stored within the parquet schema
2. The [LogicalType](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) enumeration stored within the parquet schema
3. An embedded arrow schema stored within the parquet file metadata

All parquet readers support 1, all v2 readers support 2, and only arrow readers support 3.

In some cases the logical type is not necessary to correctly interpret the data, e.g. String vs LargeString, but in some cases it fundamentally alters the meaning of the data.

## Timestamp

A nanosecond time units is not supported in 1, and a second time unit is only natively supported in 3.

Currently this crate encodes second time units as logical type of milliseconds, this is likely a bug. It also does not convert to nanoseconds to microseconds when using writer version 1, despite this only being supported in >2.6.

The python implementation will, depending on the writer version, potentially cast nanoseconds to microseconds, and seconds to milliseconds - see [here](https://arrow.apache.org/docs/dev/python/parquet.html?highlight=round%20tripping#storing-timestamps)

There does not appear to be a way to round-trip timezones in 1 or 2. The C++ implementation appears to always normalize timezones to UTC and set is_adjusted_to_utc to true.

Currently this crate sets is_adjusted_to_utc to true if the timezone is set, this is despite the writer not actually performing the normalisation. I think this is a bug.

## Date64

The arrow type Date64 is milliseconds since epoch and does not have an equivalent ConvertedType nor LogicalType.

Currently this crate converts the type to Date32 on write, losing sub-second precision. This what the C++ implementation does - see [here](https://arrow.apache.org/docs/dev/cpp/parquet.html#logical-types).

## Interval

The interval data type has a ConvertedType but not a LogicalType, there is a PR to add LogicalType support https://github.com/apache/parquet-format/pull/165 but it appears to have stalled somewhat.

An interval of MonthDayNano cannot be represented by the ConvertedType. The C++ implementation does not appear to support Interval types.

# Proposal

There are broadly speaking 3 ways to handle data types that cannot be represented in the parquet schema:

* Return an error
* Cast to a native parquet representation, potentially losing precision in the process
* Encode the data, only encoding its logical type in the embedded arrow schema

Returning an error is the safest, but is not a great UX. Casting to a native parquet representation is inline with what other arrow implementations do and gives the best ecosystem compatibility, but also doesn't make for a great UX, just search StackOverflow for `allow_truncated_timestamps` to see the confusion this causes.

The final option is the simplest to implement, the least surprising to users, and what I would propose implementing. It would break ecosystem interoperability in certain cases, but I think it is more important that we faithfully round-trip data than maintain maximal compatibility. 

Users who care about interoperability can explicitly cast data as appropriate, similar to python's `coerce_timestamps` functionality, or just restrict themselves to using a supported subset of arrow types.

Thoughts @sunchao @nevi-me @alamb @jorisvandenbossche @jorgecarleitao 

# Additional Context

* #1459 
* #1275
* #1655 
* #1663 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling Unsupported Arrow Types in Parquet #1666

Problem

Timestamp

Date64

Interval

Proposal

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling Unsupported Arrow Types in Parquet #1666

Description

Problem

Timestamp

Date64

Interval

Proposal

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions