Description
Pandas is adding capabilities to store non-nanosecond datetime64 data. At the moment, we however always do convert to nanosecond, regardless of the timestamp resolution of the arrow table (and regardless of the pandas metadata).
Using the development version of pandas:
In [1]: df = pd.DataFrame({"col": np.arange("2012-01-01", 10, dtype="datetime64[s]")})
In [2]: df.dtypes
Out[2]:
col datetime64[s]
dtype: object
In [3]: table = pa.table(df)
In [4]: table.schema
Out[4]:
col: timestamp[s]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 423
In [6]: table.to_pandas().dtypes
Out[6]:
col datetime64[ns]
dtype: object
This is because we have a coerce_temporal_nanoseconds
conversion option which we hardcode to True (for top-level columns, we hardcode it to False for nested data).
When users have pandas >= 2, we should support converting with preserving the resolution. We should certainly do so if the pandas metadata indicates which resolution was originally used (to ensure correct roundtrip).
We could (and at some point also should) also do that by default if there is no pandas metadata (but maybe only later depending on how stable this new feature is in pandas, as it is potentially a breaking change for our users if you use eg pyarrow to read a parquet file).
Reporter: Joris Van den Bossche / @jorisvandenbossche
Related issues:
- [Python][CI] Build with pandas master/nightly failure related to timedelta64 resolution (is related to)
Note: This issue was originally created as ARROW-18124. Please see the migration documentation for further details.