Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Out of bound dates can be saved to feather but not loaded #47832

Open
3 tasks done
Tracked by #55564
adrienpacifico opened this issue Jul 23, 2022 · 9 comments
Open
3 tasks done
Tracked by #55564

BUG: Out of bound dates can be saved to feather but not loaded #47832

adrienpacifico opened this issue Jul 23, 2022 · 9 comments
Labels
Bug IO Parquet parquet, feather Needs Tests Unit test(s) needed to prevent regressions

Comments

@adrienpacifico
Copy link
Contributor

adrienpacifico commented Jul 23, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from datetime import datetime
df = pd.DataFrame({"date": [
    datetime.fromisoformat("1654-01-01"),
    datetime.fromisoformat("1920-01-01"),
],})
df.to_feather("to_trash.feather")
pd.read_feather("to_trash.feather")

Issue Description

Does not return the original dataframe but raise an issue instead.

ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: -9971942400000000

Expected Behavior

return the original dataframe df.

Installed Versions

INSTALLED VERSIONS ------------------ commit : e8093ba python : 3.9.13.final.0 python-bits : 64 OS : Linux OS-release : 5.8.0-7630-generic Version : #32~1609193707~20.10~781bb80-Ubuntu SMP Tue Jan 5 21:29:56 UTC 2 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.3
numpy : 1.23.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@adrienpacifico adrienpacifico added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 23, 2022
@adrienpacifico
Copy link
Contributor Author

I think the issue is due to the arrow package. I have created this issue:
https://issues.apache.org/jira/browse/ARROW-17192

@datapythonista
Copy link
Member

This is a bit tricky.

The column in your DataFrame is not a pandas datetime, but a Python object. Those are saved differently internally. If you try to convert the column to a pandas datetime it'll fail:

>>> pd.to_datetime(df['date'])
Traceback (most recent call last):
  File "/home/mgarcia/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py", line 2211, in objects_to_datetime64ns
    values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
  File "pandas/_libs/tslibs/conversion.pyx", line 358, in pandas._libs.tslibs.conversion.datetime_to_datetime64
  File "pandas/_libs/tslibs/np_datetime.pyx", line 120, in pandas._libs.tslibs.np_datetime.check_dts_bounds
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1654-01-01 00:00:00

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mgarcia/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 1051, in to_datetime
    values = convert_listlike(arg._values, format)
  File "/home/mgarcia/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 402, in _convert_listlike_datetimes
    result, tz_parsed = objects_to_datetime64ns(
  File "/home/mgarcia/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py", line 2217, in objects_to_datetime64ns
    raise err
  File "/home/mgarcia/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py", line 2199, in objects_to_datetime64ns
    result, tz_parsed = tslib.array_to_datetime(
  File "pandas/_libs/tslib.pyx", line 381, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 608, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 604, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 476, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslibs/np_datetime.pyx", line 120, in pandas._libs.tslibs.np_datetime.check_dts_bounds
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1654-01-01 00:00:00

This is because the 1654 date is out of the range supported by pandas dates. See https://pandas.pydata.org/docs/user_guide/timeseries.html#timestamp-limitations

While your expected behavior makes total sense, I think each individual current behavior is reasonable:

  • When saving a datetime Python column to feather, save it as a date
  • When loading a feather datetime column, try to load it as a pandas datetime, and raise if a date is out of bounds

We could consider raising an exception if a Python datetime column is being saved, and force the user to cast it to a pandas datetime. But not sure about the implications.

In any case, in pandas 1.5 released soon, we should start having support for a much wider range of dates. So, your specific case will work with the new release.

@datapythonista datapythonista changed the title BUG: BUG: Out of bound dates can be saved to feather but not loaded Jul 24, 2022
@datapythonista datapythonista added Needs Discussion Requires discussion from core team before further action IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 24, 2022
@adrienpacifico
Copy link
Contributor Author

adrienpacifico commented Jul 24, 2022

EDIT: I wrote wrong things in my first message.

We could consider raising an exception if a Python datetime column is being saved, and force the user to cast it to a pandas datetime. But not sure about the implications.

How would one save the above dataframe then? Convert datetime Python object to string?

In any case, in pandas 1.5 released soon, we should start having support for a much wider range of dates. So, your specific case will work with the new release.

Yes, quite excited by non-ns datetime ! I understand that this is an issue that might be solved in the next pandas release. Do you want me to close the issue?


If anyone face this issue, you can access your feather file through this little hack:

import pyarrow.feather as feather
df = feather.read_table("to_trash.feather")
df = pd.DataFrame(df.to_pylist())

You can answer to this SO question if you have a better solution.

@adrienpacifico
Copy link
Contributor Author

I just checked with pandas 1.5.0, and the issue is still there. Will this be solved in pandas 2.0, or in a future 1.5.x ?

@adrienpacifico
Copy link
Contributor Author

Pandas 2.0 is out, and the problem still exists. However, if the feather file is opened with the dtype_backend="pyarrow", the file can be read without issue.

import pandas as pd
from datetime import datetime
df = pd.DataFrame({"date": [
    datetime.fromisoformat("1654-01-01"),
    datetime.fromisoformat("1920-01-01"),
],})
df.to_feather("test.feather")
pd.read_feather("test.feather", dtype_backend="pyarrow")

@datapythonista should we close the issue?

@datapythonista
Copy link
Member

@jbrockmendel do you want to have a look here? Looks like when a big date is loaded from feather, it'll use ns precision and will fail, unless the pyarrow backend is used. I guess this is a bug, but I may be missing something.

@jbrockmendel
Copy link
Member

Looks like the read_feather call is raising from within pyarrow. Not clear to me what we can do at that level.

On our end we might make it easier to solve by having df['date'] have dtype datetime64[us] rather than object. That requires getting unit inference working in array_to_datetime/maybe_convert_objects/infer_dtype, which is high on my todo list but not easy.

@jbrockmendel
Copy link
Member

This is fixed by #55901, but will need a test.

@jorisvandenbossche jorisvandenbossche added Needs Tests Unit test(s) needed to prevent regressions and removed Needs Discussion Requires discussion from core team before further action labels Jun 17, 2024
@jasonmokk
Copy link
Contributor

take

@jasonmokk jasonmokk removed their assignment Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

No branches or pull requests

5 participants