BUG: list-like objects are broadcast to each row (1.3 regression) #42549
Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
This requires both pandas and grpcio-tools.
import subprocess
import sys
import pandas as pd
proto = """
syntax="proto3";
message Data {
repeated float values = 1;
}
"""
with open('data.proto', 'w') as f:
f.write(proto)
subprocess.run([
sys.executable, '-m', 'grpc_tools.protoc', '-I.',
'--python_out=.', '--grpc_python_out=.', 'data.proto'
])
from data_pb2 import Data
proto_data = Data(values=range(3))
print(proto_data.values)
print(type(proto_data.values))
df = pd.DataFrame(index=range(3), data={'a': proto_data.values})
print(df)
Problem description
On 1.3.0 and the master branch this code prints:
[0.0, 1.0, 2.0]
<class 'google.protobuf.pyext._message.RepeatedScalarContainer'>
a
0 [0.0, 1.0, 2.0]
1 [0.0, 1.0, 2.0]
2 [0.0, 1.0, 2.0]
The issue seems to arise from #41592. In 1.2.x this object was handled by _try_cast
in this else clause. After that change, it's handled by construct_1d_arraylike_from_scalar
because is_list_like(data)
returns False. Note that RepeatedScalarContainer implements PyTypeObject.tp_as_sequence
but not PyTypeObject.tp_iter
, so list(proto_data.values)
works fine, but the hasattr(obj, "__iter__")
check in is_list_like
is False.
Based on all this, I suspect that this same issue will occur on any object which implements PyTypeObject.tp_as_sequence
but not PyTypeObject.tp_iter
, however this protobuf object the only example I have right now so I can't test further.
I'm not familiar enough with Cython to provide a full fix, but is there some way to examine the struct fields of obj
in is_list_like
? If so, that function could be amended to check hasattr(obj, "__iter__") or hasattr(obj, "tp_as_sequence")
. If not I think the logic of sanitize_array
needs to be amended.
Expected Output
On 1.2.x this prints:
[0.0, 1.0, 2.0]
<class 'google.protobuf.pyext._message.RepeatedScalarContainer'>
a
0 0.0
1 1.0
2 2.0
Output of pd.show_versions()
pandas : 1.3.0
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 53.0.0
Cython : 0.29.22
pytest : 6.2.2
hypothesis : 6.3.4
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None