Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.attrs are lost when writing to HDF5 #34596

Open
2 of 3 tasks
ulijh opened this issue Jun 5, 2020 · 5 comments
Open
2 of 3 tasks

BUG: DataFrame.attrs are lost when writing to HDF5 #34596

ulijh opened this issue Jun 5, 2020 · 5 comments
Labels
Bug IO HDF5 read_hdf, HDFStore metadata _metadata, .attrs

Comments

@ulijh
Copy link

ulijh commented Jun 5, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas. Version 1.0.3

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

In [9]: df = pd.DataFrame(index=[1, 2, 3], columns=list("abcde"), data=np.ones((3,5)))                                                  

In [10]: df.attrs["foo"] = "bar"  

In [11]: df.to_hdf("test_df.h5", key="key")                         

In [12]: df_from_h5 = pd.read_hdf("test_df.h5")                     

In [13]: assert df.attrs == df_from_h5.attrs, "attrs have gone"     
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-17-1fab1bc115de> in <module>
----> 1 assert df.attrs == df_from_h5.attrs, "attrs have gone"

AssertionError: attrs have gone

Problem description

The metadata stored in attributes is gone after the DataFrame was read back from disk. I understand the attrs dict is WIP. I hope this issue will help to move this forward! Thanks!

Related: #29062

Expected Output

The attrs should be the same as in the DataFrame written to disk.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.8.3.final.0 python-bits : 64 OS : Linux OS-release : 5.6.15-arch1-1 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : de_DE.utf8 LOCALE : de_DE.UTF-8

pandas : 1.0.3
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 47.1.1
Cython : 0.29.19
pytest : 5.4.2
hypothesis : None
sphinx : 3.0.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.2
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : None
xarray : 0.15.2.dev47+g33a66d63
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.49.1

@ulijh ulijh added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 5, 2020
@jreback
Copy link
Contributor

jreback commented Jun 5, 2020

pull requests would move this issue forward

@ulijh
Copy link
Author

ulijh commented Jun 5, 2020

If i find the time, I'll try to come up with sth.

@TomAugspurger
Copy link
Contributor

It might be worth looking at how xarray handles these. Ideally we would be compatible with how / where they store metadata.

@jbrockmendel jbrockmendel added IO HDF5 read_hdf, HDFStore metadata _metadata, .attrs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 5, 2020
@snowman2
Copy link

Related: pydata/xarray#3497

@janosh
Copy link
Contributor

janosh commented Aug 25, 2022

The same is true for to_json btw (and probably all serialization methods?).

from datetime import datetime

import pandas as pd

df = pd.util.testing.makeMixedDataFrame()

today = f"{datetime.now():%Y-%m-%d}"
df.attrs["created_at"] = today
df.to_json("test.json")

df_from_json = pd.read_json("test.json")

assert df.attrs == df_from_json.attrs, f"{df_from_json.attrs = }"
>>> AssertionError: df_from_json.attrs = {}

attrs will be a tremendously useful feature once it gets better permanence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore metadata _metadata, .attrs
Projects
None yet
Development

No branches or pull requests

6 participants