BUG: Calling df._repartition(axis=1) on updated df will raise IndexError #7170
Description
Modin version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest released version of Modin.
-
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
import time
import modin.pandas as pd
import modin.config as cfg
import numpy as np
import ray
from modin.distributed.dataframe.pandas import unwrap_partitions, from_partitions
from sklearn.preprocessing import RobustScaler
from sklearn.tree import DecisionTreeClassifier
ray.init()
# Config modin to partition dataframe into 5 partitions and not to partition against columns
cfg.MinPartitionSize.put(102)
cfg.NPartitions.put(5)
# Generate samples
data = np.random.rand(10000, 100)
label = [i for i in range(1, 9)] * 1250
features = ['feature' + str(i) for i in range(1, 101)]
df = pd.DataFrame(data=data, columns=features)
df['label'] = label
# Scale samples
scaler = RobustScaler()
res = scaler.fit_transform(df[[column for column in df.columns if column != 'label']].to_numpy())
frame = pd.DataFrame(res, columns=[column for column in df.columns if column != 'label'])
# Update dataframe
df.update(frame)
# Repartition to make dataframe contain only 1 partition against columns
# This will work
partitions = unwrap_partitions(df, axis=0)
df = from_partitions(partitions, axis=0)
# This will raise an error
# df = df._repartition(axis=1)
# Fit a DTC model of sklearn
clf = DecisionTreeClassifier()
features = df[df.columns.drop(['label'])].to_numpy()
clf.fit(features, label)
Issue Description
I created a dataframe whose shape is (10000,101).
In order to make the df contain only 1 partition against columns, I followed instruction from @YarShev that setting MinPartitionSize would make it.
Then I scaled the df with RobustScaler from sklearn and tried to fit a DTC model.
Yet I found the updated df was partitioned against columns again which made the fitting take about twice as long.
So I tried repartitioning the df only against columns by calling df = df._repartition(axis=1)
. Yet I got an IndexError.
But I managed to solve the problem by calling unwrap_partitions
and from_partitions
.
Expected Behavior
df._repartition(axis=1)
will make the updated df contain only 1 partition against columns. And the repartitioned df could be feed into DTC.
Error Logs
Traceback (most recent call last):
File "D:\Work\Python\RayDemo3.8\aaaa.py", line 41, in <module>
features = df[df.columns.drop(['label'])].to_numpy()
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\pandas\base.py", line 3138, in to_numpy
return self._to_bare_numpy(
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\pandas\base.py", line 3119, in _to_bare_numpy
return self._query_compiler.to_numpy(
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\storage_formats\pandas\query_compiler.py", line 376, in to_numpy
arr = self._modin_frame.to_numpy(**kwargs)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\dataframe\pandas\dataframe\dataframe.py", line 3882, in to_numpy
return self._partition_mgr_cls.to_numpy(self._partitions, **kwargs)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\generic\partitioning\partition_manager.py", line 43, in to_numpy
parts = RayWrapper.materialize(
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\common\engine_wrapper.py", line 92, in materialize
return ray.get(obj_id)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\worker.py", line 2667, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\worker.py", line 864, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::_apply_list_of_funcs() (pid=10084, ip=127.0.0.1)
File "python\ray\_raylet.pyx", line 1889, in ray._raylet.execute_task
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\implementations\pandas_on_ray\partitioning\partition.py", line 440, in _apply_list_of_funcs
partition = func(partition, *args, **kwargs)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\dataframe\pandas\partitioning\partition.py", line 217, in _iloc
return df.iloc[row_labels, col_labels]
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1097, in __getitem__
return self._getitem_tuple(key)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1594, in _getitem_tuple
tup = self._validate_tuple_indexer(tup)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 904, in _validate_tuple_indexer
self._validate_key(k, i)
File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1516, in _validate_key
raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds
Installed Versions
INSTALLED VERSIONS
commit : 0c3746b
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Chinese (Simplified)_China.936
Modin dependencies
modin : 0.23.1.post0
ray : 2.10.0
dask : 2023.5.0
distributed : None
hdk : None
pandas dependencies
pandas : 2.0.3
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.4.6
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.10.0
gcsfs : None
matplotlib : 3.7.4
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : 2.0.25
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
None