Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Store metadata in an attrs dict #29062

Merged
merged 5 commits into from
Oct 22, 2019

Conversation

TomAugspurger
Copy link
Contributor

This aids in the implementation of
#28394. Over there, I'm having
issues with using NDFrame.__finalize__ to copy attributes, in part
because getattribute on NDFrame is so complicated.

This simplifies that PR because we only need to look in NDFrame.attrs (name borrowed from xarray),
which is just a plain dictionary.

Aside from the addition of a public NDFrame.attrs dictionary, there
aren't any user-facing API changes.

This aids in the implementation of
pandas-dev#28394. Over there, I'm having
issues with using `NDFrame.__finalize__` to copy attributes, in part
because getattribute on NDFrame is so complicated.

This simplifies this because we only need to look in NDFrame.attrs,
which is just a plain dictionary.

Aside from the addition of a public NDFrame.attrs dictionary, there
aren't any user-facing API changes.
@TomAugspurger TomAugspurger added Refactor Internal refactoring of code metadata _metadata, .attrs labels Oct 17, 2019
@TomAugspurger TomAugspurger added this to the 1.0 milestone Oct 17, 2019
return dict(
_data=self._data,
_typ=self._typ,
_metadata=self._metadata,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are attrs linked to _metadata in any way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, though perhaps eventually if we can find a suitable way to deprecate _metadata for subclasses. With this PR, pandas doesn't use _metadata itself anymore (previously, we just used it for Series.name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, then we should deprecate _metadata right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely? I have no idea who is using it or what they're using it for though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can you open an issue to deprecate that

pandas/core/generic.py Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
return dict(
_data=self._data,
_typ=self._typ,
_metadata=self._metadata,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, then we should deprecate _metadata right?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

return dict(
_data=self._data,
_typ=self._typ,
_metadata=self._metadata,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can you open an issue to deprecate that

@@ -5213,6 +5245,9 @@ def __finalize__(self, other, method=None, **kwargs):

"""
if isinstance(other, NDFrame):
for name in other.attrs:
self.attrs[name] = other.attrs[name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.attrs.update(other) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd work for now, but we'll need a for loop in my other PR since we'll have per-attr logic to resolve these.

@jreback
Copy link
Contributor

jreback commented Oct 19, 2019

do you want to add docs anywhere / whatsnew? or just keep private mention for now?

@TomAugspurger
Copy link
Contributor Author

do you want to add docs anywhere / whatsnew? or just keep private mention for now?

I pushed some basics API docs. Will want a larger section once things settle I think.

@jreback jreback merged commit 0a2c418 into pandas-dev:master Oct 22, 2019
@jreback
Copy link
Contributor

jreback commented Oct 22, 2019

thanks @TomAugspurger very nice!

@shoyer
Copy link
Member

shoyer commented Oct 24, 2019

In xarray, we currently use duck-typing to check for .attrs attributes in the xarray.DataArray constructor. So this PR broke our tests (see pydata/xarray#3440), but that's probably our own fault.

My bigger concern is that I noticed that attrs is used in different way in pandas than xarray: name is a member of attrs vs. in xarray where DataArray.name is a separate attribute. This is not an unreasonable thing to do, but it does introduce some minor incompatible: in xarray (and h5py) attrs really is an arbitrary mapping, but that isn't quite true in pandas (because name needs to be hashable). Also, what happens if you try to delete name from the attrs dict?

HawkinsBA pushed a commit to HawkinsBA/pandas that referenced this pull request Oct 29, 2019
Reksbril pushed a commit to Reksbril/pandas that referenced this pull request Nov 18, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 7, 2020

Also, what happens if you try to delete name from the attrs dict?

From a quick test, it seems this deletes the name (s.name becomes None); which I suppose is the expected behaviour?

But you can indeed set it to a non-hashable value that way, which doesn't directly raise an error, but can confusingly give errors in later operations:

In [9]: s = pd.Series([1, 2, 3], name='a')  

In [10]: s.attrs  
Out[10]: {'name': 'a'}

In [12]: s.attrs['name'] = [1, 2] 

In [13]: s 
Out[13]: 
0    1
1    2
2    3
Name: [1, 2], dtype: int64

In [14]: s.to_frame()  
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-78ea8628cc3c> in <module>
----> 1 s.to_frame()

~/scipy/pandas/pandas/core/series.py in to_frame(self, name)
   1533         """
   1534         if name is None:
-> 1535             df = self._constructor_expanddim(self)
   1536         else:
   1537             df = self._constructor_expanddim({name: self})

~/scipy/pandas/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    456                 data_columns = list(data.dtype.names)
    457                 data = {k: data[k] for k in data_columns}
--> 458                 if columns is None:
    459                     columns = data_columns
    460                 mgr = init_dict(data, index, columns, dtype=dtype)

TypeError: unhashable type: 'list'

@jorisvandenbossche
Copy link
Member

Do we need to further discuss @shoyer's concern?

I was planning to try this out in GeoPandas (to see if it can replace our usage of _metadata), to be able to give feedback on this feature that way, but didn't get to it yet.

@TomAugspurger
Copy link
Contributor Author

Sorry I missed @shoyer's comment. I'll take a look.

@TomAugspurger TomAugspurger deleted the meta-attrs branch January 7, 2020 22:07
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Jan 7, 2020
TomAugspurger added a commit that referenced this pull request Jan 8, 2020
* API: Store name outside attrs

This aligns with xarray and h5py:
#29062 (comment)
@buhrmann
Copy link

buhrmann commented Jan 30, 2020

Hi, not sure this is the best issue for my comments below, or too early, but I've been wanting for a while to store metadata along with Series or Dataframes. I've tried the subclassing approach, with _constructor_expanddim, _constructor etc., but eventually there was always some method that lead to the loss of my metadata. It seems that this new approach doesn't really solve the problem either (not sure if it's supposed to work like this):

In [17]: s = pd.Series([0,1,2], name="x")                                                                                                                                                                                                                                       
In [18]: s.attrs["mydata"] = "test"                                                                                                                                                                                                                                             
In [19]: s.attrs                                                                                                                                                                                                                                                                
Out[19]: {'mydata': 'test'}

In [20]: df = s.to_frame()                                                                                                                                                                                                                                                      
In [21]: df.x.attrs                                                                                                                                                                                                                                                             
Out[21]: {}

Is this supposed to work? Or do I misinterpret the purpose of attrs?

Edit:
For reference, it seems it may only be the "expanddim" direction from Series to DF doesn't seem to work (using s.to_frame() or the Dataframe constructor directly), while setting a column's attrs that already exists in a DF, and indexing or manipulating etc. seems to maintain the metadata correctly (though I haven't tested more complicated cases like aggregations etc.).

@TomAugspurger
Copy link
Contributor Author

Pandas will need to call finalize in more places to ensure the metadata is propagated.

@jorisvandenbossche
Copy link
Member

@buhrmann can you open a new issue with this specific use case?

@dotsdl
Copy link

dotsdl commented Feb 13, 2020

Is it appearing likely that DataFrame.attrs is here to stay? I'm considering using this for metadata used to differentiate DataFrames of two forms in alchemlyb, but wanted to check the temperature of the pandas devs on this question before relying on it. So far it works as expected (e.g. it survives such things as DataFrame.copy())!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata _metadata, .attrs Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants