Skip to content

[Python] Support converting to non-nano datetime64 for pandas >= 2.0 #33321

Closed
@asfimport

Description

@asfimport

Pandas is adding capabilities to store non-nanosecond datetime64 data. At the moment, we however always do convert to nanosecond, regardless of the timestamp resolution of the arrow table (and regardless of the pandas metadata).

Using the development version of pandas:

In [1]: df = pd.DataFrame({"col": np.arange("2012-01-01", 10, dtype="datetime64[s]")})

In [2]: df.dtypes
Out[2]: 
col    datetime64[s]
dtype: object

In [3]: table = pa.table(df)

In [4]: table.schema
Out[4]: 
col: timestamp[s]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 423

In [6]: table.to_pandas().dtypes
Out[6]: 
col    datetime64[ns]
dtype: object

This is because we have a coerce_temporal_nanoseconds conversion option which we hardcode to True (for top-level columns, we hardcode it to False for nested data).

When users have pandas >= 2, we should support converting with preserving the resolution. We should certainly do so if the pandas metadata indicates which resolution was originally used (to ensure correct roundtrip).
We could (and at some point also should) also do that by default if there is no pandas metadata (but maybe only later depending on how stable this new feature is in pandas, as it is potentially a breaking change for our users if you use eg pyarrow to read a parquet file).

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-18124. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions