Skip to content

BUG: Assigning extension array value to series of dtype object fails if element type is array-like #42437

Open
@frreiss

Description

@frreiss
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
from pandas.api.extensions import ExtensionArray, ExtensionDtype

class StubDtype(ExtensionDtype):
    """Extension dtype whose elements are something that numpy.asarray() 
    will turn into an array (in this case a tuple)"""
    def __init__(self):
        pass
    
    @property
    def type(self):
        return tuple
    
    @property
    def name(self) -> str:
        return "StubDtype"
        
    @classmethod
    def construct_array_type(cls):
        return StubExtensionArray()

class StubExtensionArray(ExtensionArray):
    """Just enough of an extension array to run the four lines of code 
    that follow."""
    @property
    def dtype(self):
        return StubDtype()
    
    def copy(self):
        return StubExtensionArray()
    
    def __len__(self):
        return 5
    
    def __getitem__(self, key):
        # Every position in the array has the tuple (1, 2, 3)
        return (1, 2, 3)
    
    

data = StubExtensionArray()
series1 = pd.Series(data, name="data")
series2 = pd.Series(index=series1.index, dtype=object, name="data")
series2.loc[series1.index] = data

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-168-cfd127e7fa55> in <module>
     41 series1 = pd.Series(data, name="data")
     42 series2 = pd.Series(index=series1.index, dtype=object, name="data")
---> 43 series2.loc[series1.index] = data

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    721 
    722         iloc = self if self.name == "iloc" else self.obj.iloc
--> 723         iloc._setitem_with_indexer(indexer, value, self.name)
    724 
    725     def _validate_key(self, key, axis: int):

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
   1730             self._setitem_with_indexer_split_path(indexer, value, name)
   1731         else:
-> 1732             self._setitem_single_block(indexer, value, name)
   1733 
   1734     def _setitem_with_indexer_split_path(self, indexer, value, name: str):

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
   1966 
   1967         # actually do the set
-> 1968         self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
   1969         self.obj._maybe_update_cacher(clear=True)
   1970 

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/internals/managers.py in setitem(self, indexer, value)
    353 
    354     def setitem(self: T, indexer, value) -> T:
--> 355         return self.apply("setitem", indexer=indexer, value=value)
    356 
    357     def putmask(self, mask, new, align: bool = True):

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    325                     applied = b.apply(f, **kwargs)
    326                 else:
--> 327                     applied = getattr(b, f)(**kwargs)
    328             except (TypeError, NotImplementedError):
    329                 if not ignore_failures:

~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/internals/blocks.py in setitem(self, indexer, value)
    965                 values[indexer] = value.to_numpy(value.dtype.numpy_dtype)
    966             else:
--> 967                 values[indexer] = np.asarray(value)
    968 
    969         # if we are an exact match (ex-broadcasting),

ValueError: shape mismatch: value array of shape (5,3) could not be broadcast to indexing result of shape (5,)

Problem description

If the user creates a Series of dtype object and attempts to set the value of that Series with an extension array, the block manager will first pass the extension array through np.asarray() and then assign the block's values to the ndarray returned by np.asarray() (See code here).

This logic assumes that np.asarray() will always return a 1D array. However, np.asarray() is not guaranteed to return a 1D array; if the items of the argument to np.asarray() are array-like, np.asarray() will iterate over them and generate an array with 2 or more dimensions. This 2- or-more-dimensional array can't be assigned to the series, and Pandas throws the error "ValueError: shape mismatch...".

This problem is affecting the TensorArray extension type in Text Extensions for Pandas, because the elements of a TensorArray are tensors. The example code above shows a simpler case where the items of the extension array are Python tuples. In general, any item type that np.asarray() converts to an array of one or more dimensions will have this problem.

Expected Output

The above code should fill series2 with the individual objects at each of the positions in the extension array. In the case of the example code above, that means that each element of series2 should contain the Python tuple (1, 2, 3).

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : f00ed8f python : 3.7.10.final.0 python-bits : 64 OS : Darwin OS-release : 20.5.0 Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugExtensionArrayExtending pandas with custom dtypes or arrays.IndexingRelated to indexing on series/frames, not to indexes themselvesNeeds TestsUnit test(s) needed to prevent regressionsRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions