Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series .attrs is not correctly maintained/propagated in to_frame #31452

Closed
buhrmann opened this issue Jan 30, 2020 · 3 comments
Closed

Series .attrs is not correctly maintained/propagated in to_frame #31452

buhrmann opened this issue Jan 30, 2020 · 3 comments
Labels
Bug metadata _metadata, .attrs

Comments

@buhrmann
Copy link

Hi, it seems that the .attrs dict (for storing metadata) is not always propagated correctly. E.g.

Code Sample

s = pd.Series([0,1,2], name="x")                                                                                                                                                                                                                                       
s.attrs["mydata"] = "test"                                                                                                                                                                                                                                             
s.attrs                                                                                                                                                                                                                                                                
# >> {'mydata': 'test'}

df = s.to_frame()                                                                                                                                                                                                                                                      
df.x.attrs                                                                                                                                                                                                                                                             
# >> {}

Problem description

As @jorisvandenbossche commented here, it seems that this is because there are a few places in pandas where calls to finalize are missing.

I ran into it mainly in the "expanddim" direction (from Series to DF). Copying, subsetting and manipulating a Dataframe column that has attrs explicitly set, in contrast, seems to work without losing the metadata. I haven't tried more complicated cases, like aggregations (I don't know much about pandas internals, but from previous attempts to implement metadata in my own subclass I seem to remember that these have special code paths in the BlockManager parts, where I suspect the same or related problem may occur...).

Expected Output

df.x.attrs                                                                                                                                                                                                                                                             
# >> {'mydata': 'test'}

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.0.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.0.0
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200119
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@TomAugspurger
Copy link
Contributor

#28394 is exploring this a bit (stalled for now though). Specifically https://github.com/pandas-dev/pandas/pull/28394/files#diff-03b380f521c43cf003207b0711bac67fR5275

Basically we need to call __finalize__ in a bunch more places when we return NDFrames. In many places where the only NDFrame that's an argument is self, that's straightforward. Just call result.__finalize__(self).

In other cases, like concat, binops, etc. it's less straightforward since there are multiple NDFrames whose metadata needs to be finalized. We would need a policy for resolving differing attributes (including cases where they are both present but differ, or present in just some and missing in others). Or we need an API for library authors / users to choose how attributes are resolved.

Finally, we need to consider the performance overhead of calling __finalize__.


Concretely, I think we can keep calling .__finalize__ in the obvious places like Series.to_frame / expanddim.

@TomAugspurger TomAugspurger added the metadata _metadata, .attrs label Jan 30, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jan 30, 2020
@buhrmann
Copy link
Author

As a default strategy for methods with multiple NDFrames, like concat, merge etc., it would make sense to me if the metadata was simply updated from left to right (i.e. in user-specified order), such that consecutive NDFrame .attrs potentially overwrite earlier ones, at least as a first step.

As an alternative, perhaps one could replicate the suffixes strategy for column names in pd.merge(), but applied to the attrs.keys?

@mroeschke mroeschke added the Bug label Apr 3, 2020
@mroeschke mroeschke changed the title Series .attrs is not correctly maintained/propagated in all cases Series .attrs is not correctly maintained/propagated in to_frame Apr 3, 2020
@TomAugspurger
Copy link
Contributor

Closing this as a duplicate of #28283

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug metadata _metadata, .attrs
Projects
None yet
Development

No branches or pull requests

3 participants