Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
from pandas.api.extensions import ExtensionArray, ExtensionDtype
class StubDtype(ExtensionDtype):
"""Extension dtype whose elements are something that numpy.asarray()
will turn into an array (in this case a tuple)"""
def __init__(self):
pass
@property
def type(self):
return tuple
@property
def name(self) -> str:
return "StubDtype"
@classmethod
def construct_array_type(cls):
return StubExtensionArray()
class StubExtensionArray(ExtensionArray):
"""Just enough of an extension array to run the four lines of code
that follow."""
@property
def dtype(self):
return StubDtype()
def copy(self):
return StubExtensionArray()
def __len__(self):
return 5
def __getitem__(self, key):
# Every position in the array has the tuple (1, 2, 3)
return (1, 2, 3)
data = StubExtensionArray()
series1 = pd.Series(data, name="data")
series2 = pd.Series(index=series1.index, dtype=object, name="data")
series2.loc[series1.index] = data
Output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-168-cfd127e7fa55> in <module>
41 series1 = pd.Series(data, name="data")
42 series2 = pd.Series(index=series1.index, dtype=object, name="data")
---> 43 series2.loc[series1.index] = data
~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
721
722 iloc = self if self.name == "iloc" else self.obj.iloc
--> 723 iloc._setitem_with_indexer(indexer, value, self.name)
724
725 def _validate_key(self, key, axis: int):
~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
1730 self._setitem_with_indexer_split_path(indexer, value, name)
1731 else:
-> 1732 self._setitem_single_block(indexer, value, name)
1733
1734 def _setitem_with_indexer_split_path(self, indexer, value, name: str):
~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
1966
1967 # actually do the set
-> 1968 self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
1969 self.obj._maybe_update_cacher(clear=True)
1970
~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/internals/managers.py in setitem(self, indexer, value)
353
354 def setitem(self: T, indexer, value) -> T:
--> 355 return self.apply("setitem", indexer=indexer, value=value)
356
357 def putmask(self, mask, new, align: bool = True):
~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
325 applied = b.apply(f, **kwargs)
326 else:
--> 327 applied = getattr(b, f)(**kwargs)
328 except (TypeError, NotImplementedError):
329 if not ignore_failures:
~/opt/miniconda3/envs/pd/lib/python3.7/site-packages/pandas/core/internals/blocks.py in setitem(self, indexer, value)
965 values[indexer] = value.to_numpy(value.dtype.numpy_dtype)
966 else:
--> 967 values[indexer] = np.asarray(value)
968
969 # if we are an exact match (ex-broadcasting),
ValueError: shape mismatch: value array of shape (5,3) could not be broadcast to indexing result of shape (5,)
Problem description
If the user creates a Series of dtype object
and attempts to set the value of that Series with an extension array, the block manager will first pass the extension array through np.asarray()
and then assign the block's values to the ndarray returned by np.asarray()
(See code here).
This logic assumes that np.asarray()
will always return a 1D array. However, np.asarray()
is not guaranteed to return a 1D array; if the items of the argument to np.asarray()
are array-like, np.asarray()
will iterate over them and generate an array with 2 or more dimensions. This 2- or-more-dimensional array can't be assigned to the series, and Pandas throws the error "ValueError: shape mismatch...".
This problem is affecting the TensorArray extension type in Text Extensions for Pandas, because the elements of a TensorArray are tensors. The example code above shows a simpler case where the items of the extension array are Python tuples. In general, any item type that np.asarray()
converts to an array of one or more dimensions will have this problem.
Expected Output
The above code should fill series2
with the individual objects at each of the positions in the extension array. In the case of the example code above, that means that each element of series2
should contain the Python tuple (1, 2, 3)
.
Output of pd.show_versions()
pandas : 1.3.0
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None