Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer _metadata from Subclassed DataFrame to Subclassed Series #19850

Open
jaumebonet opened this issue Feb 22, 2018 · 11 comments
Open

Transfer _metadata from Subclassed DataFrame to Subclassed Series #19850

jaumebonet opened this issue Feb 22, 2018 · 11 comments
Labels
Enhancement metadata _metadata, .attrs Subclassing Subclassing pandas objects

Comments

@jaumebonet
Copy link
Contributor

Code Sample, a copy-pastable example if possible

Let's assume the following subclassing case:

import pandas as pd

# Define a subclass of Series
class ExtendedSeries( pd.Series ):
     _metadata = ['_reference']

     def __init__( self, *args, **kwargs ):
        reference = kwargs.pop('reference', {})
        super(ExtendedSeries, self).__init__(*args, **kwargs)
        self._reference = reference

    @property
    def _constructor(self):
        return ExtendedSeries

# Define a subclass of DataFrame that slices into the ExtendedSeries class
class ExtendedFrame( pd.DataFrame ):

     _metadata = ['_reference']

    def __init__( self, *args, **kwargs ):
        reference = kwargs.pop('reference', {})
        super(ExtendedFrame, self).__init__(*args, **kwargs)
        self._reference = reference

    @property
    def _constructor(self):
        return ExtendedFrame
    @property
    def _constructor_sliced(self):
        return ExtendedSeries

Problem description

This works fine, but does not allow to transfer extended metadata from the ExtendedFrame to the ExtendedSeries.
As far as I understand, writing the _constructor_sliced as follows should work:

import pandas as pd

# Define a subclass of DataFrame that slices into the ExtendedSeries class
class ExtendedFrame( pd.DataFrame ):

     _metadata = ['_reference']

    def __init__( self, *args, **kwargs ):
        reference = kwargs.pop('reference', {})
        super(ExtendedFrame, self).__init__(*args, **kwargs)
        self._reference = reference

    @property
    def _constructor(self):
        return ExtendedFrame
    @property
    def _constructor_sliced(self):
        a = ExtendedSeries([], reference=self._reference)
        return a.__init__

this would allow to first set the metadata and then return the object to initialice its data. Isn't it?
But defining it this way gives errors in core/frame:2166 and core/frame:2563. In both cases this is due to the call self._constructor_sliced._from_array().

Seeing that Series.from_array has been labeled as deprecated and that Series._from_array calls the class' constructor. Couldn't it be possible to just change the two instances of self._constructor_sliced._from_array() to self._constructor_sliced()?
If I'm seeing this correctly, wouldn't this change allow for this level of flexibility in subclassing without affecting the regular functionality?

Output of pd.show_versions()

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.1
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.0
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.4.1
sphinx: 1.6.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.1
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Feb 23, 2018

this is not implemented in any way. It IS possible, but slightly non-trivial to actually make this work properly. it is similar to #13208. We don't have the support for this in indexing. If you have interest in making this happen, then please submit a PR.

@jreback jreback added this to the Someday milestone Feb 23, 2018
@jaumebonet
Copy link
Contributor Author

I see.
I guess adding it to indexing is a bit out of my knowledge of pandas.
Regardless, seeing the issue you point out, making the two-line change I'm proposing would allow users to apply the workaround of #13208 to _constructor_sliced. Something like this:

    @property
    def _constructor_sliced(self):
        def f(*args, **kwargs):
        # adapted from https://github.com/pandas-dev/pandas/issues/13208#issuecomment-326556232
            return DesignSeries(*args, **kwargs).__finalize__(self, method='inherit')
        return f

which right now won't work, allows to process the inheritance in __finalize__. It is not a full-fledge automatic solution but would facilitate working with derived classes.
I've tried locally and it seems to work for me. I would be happy to send a PR with this if it makes sense for you guys.

@jorisvandenbossche
Copy link
Member

See #18258 where from_array was deprecated. The main problem in replacing _constructor_sliced._from_array with just _constructor_sliced is that that handles sparse arrays a bit differently. So that is something we need to solve.

@jaumebonet
Copy link
Contributor Author

jaumebonet commented Feb 23, 2018

oh! I see!
I don't get how I missed it before.
Couldn't this be added to the Series' __new__?
Kind of like:

def __new__( cls, *args, **kwargs ):
    # arr is mandatory, first argument or key `arr`.
    if isinstance(kwargs.get('arr', args[0]), ABCSparseArray):
        from pandas.core.sparse.series import SparseSeries
        cls = SparseSeries
    obj = object.__new__(cls)
    obj.__init__(*args, **kwargs)
    return obj

This way the check is kept in the constructor itself.

@jorisvandenbossche
Copy link
Member

That's a possibility indeed, but a change in API that we need to discuss (I am not familiar enough with sparse to really understand the possible consequences).
Eg currently you can actually hold spars array data in a normal Series:

In [27]: s = pd.Series(pd.SparseArray([1, 0, 0, 2, 0]))

In [28]: s
Out[28]: 
0    1
1    0
2    0
3    2
4    0
dtype: int64

In [29]: s.dtype
Out[29]: dtype('int64')

In [30]: s.ftype
Out[30]: 'int64:sparse'

In [31]: s.values
Out[31]: 
[1, 0, 0, 2, 0]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32)

but I am not fully sure what the difference / (dis)advantage of that is compared to SparseSeries.

Probably best to open a separate issue about that first.

@theOehrly
Copy link
Contributor

Hi, I ran into this problem, and I've been thinking about it for a bit. I think, since this issue was opened, quite a few things have changed in Pandas, so maybe it's worth taking another look at this.

Returning a function in ._constructor_sliced as proposed by @jaumebonet above is now possible, as far as I can tell. And the difficulty with SparseSeries seems solved too, as these don't exist as separate objects any more?
But what actually speaks against following up every call of .constructor_sliced with a call to .__finalize__ of the returned object? I just gave it a quick test and on the first glance it seems like it is possible? I guess there are some more difficulties involved though and it is not that easy really?

@jorisvandenbossche
Copy link
Member

Returning a function in ._constructor_sliced as proposed by @jaumebonet above is now possible, as far as I can tell.

That's indeed possible, as far as I know. In the GeoPandas project we are actually using a function as the return value in _constructor_sliced.
(and SparseSeries indeed doesn't exist anymore)

But what actually speaks against following up every call of .constructor_sliced with a call to .__finalize__ of the returned object? I

What do you mean exactly with this sentence?

@theOehrly
Copy link
Contributor

I meant calling .__finalize__ internally in pandas instead of wrapping this up in some kind of constructor function.

For example here:

pandas/pandas/core/frame.py

Lines 3425 to 3449 in 08104e8

def _ixs(self, i: int, axis: int = 0):
"""
Parameters
----------
i : int
axis : int
Notes
-----
If slice passed, the resulting data will be a view.
"""
# irow
if axis == 0:
new_values = self._mgr.fast_xs(i)
# if we are a copy, mark as such
copy = isinstance(new_values, np.ndarray) and new_values.base is None
result = self._constructor_sliced(
new_values,
index=self.columns,
name=self.index[i],
dtype=new_values.dtype,
)
result._set_is_copy(self, copy=copy)
return result

What about adding result.__finalize__(self) before the return statement?
This would of course need to be done at any point where a new object is created using ._constructor_sliced.

@jorisvandenbossche
Copy link
Member

Ah, OK, I see. In principle, it should not be needed to call __finalize__ yourself inside _constructor(_sliced/expanddim), as in theory we should call it in pandas.
(eg in GeoPandas we don't call __finalize__ in our constructor functions)

There will be places in pandas where we don't (yet) call __finalize__, like the ixs case you link, but I think we should consider that as case we should fix in pandas.

@jorisvandenbossche
Copy link
Member

We have a general issue about improve the coverage in pandas where we call __finalize__: #28283

@theOehrly
Copy link
Contributor

So, the propagation of metadata from a DataFrame to a Series is generally the same problem as all the other missing calls of __finalize__?
After all, __finalize__ copies attrs and flags as well. Should these "attributes" (or whatever you would call them) be transfered from a DataFrame to a Series as well?

@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement metadata _metadata, .attrs Subclassing Subclassing pandas objects
Projects
None yet
Development

No branches or pull requests

7 participants