-
Notifications
You must be signed in to change notification settings - Fork 670
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 10.15.7
- Modin version (
modin.__version__): 0.8.2 - Python version: 3.8.5
- Code we can use to reproduce:
import pandas as pd
print("===PANDAS===")
s = pd.Series(['green'])
print(s)
print(type(s))
su = s.unique()
# Leads to same error as modin's unique()
# su = s.unique().squeeze()
print(su)
print(type(su))
print(len(su))
import modin.pandas as md
print("\n===MODIN===")
s = md.Series(['green'])
print(s)
print(type(s))
su = s.unique()
print(su)
print(type(su))
print(len(su))
Describe the problem
Whenever unique is called on a Series and there is only one unique value, Modin will output a scalar numpy value whereas Pandas will output an numpy array of length 1. As a result, trying to call len on Modin's unique result throws an error because scalar values do not have an len attribute, but Pandas does not. This is likely because Modin's implementation calls squeeze as squeezing an array of length 1 transforms it into a scalar.
This error does not occur when there are two or more unique values. The solution could be to remove squeeze from Modin's unique implementation. I will do more testing and try to follow up with a PR.
Source code / logs
Log from above code to reproduce:
===PANDAS===
0 green
dtype: object
<class 'pandas.core.series.Series'>
['green']
<class 'numpy.ndarray'>
1
===MODIN===
0 green
dtype: object
<class 'modin.pandas.series.Series'>
green
<class 'numpy.ndarray'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-1-c4f0aa247643> in <module>
19 print(su)
20 print(type(su))
---> 21 print(len(su))
TypeError: len() of unsized object
Source code for Modin's unique (calls squeeze after to_numpy):
Lines 1347 to 1348 in c86422a
| def unique(self): | |
| return self._query_compiler.unique().to_numpy().squeeze() |