Skip to content

BUG: Calling df._repartition(axis=1) on updated df will raise IndexError #7170

Closed
@Taurus-Le

Description

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import time

import modin.pandas as pd
import modin.config as cfg
import numpy as np
import ray
from modin.distributed.dataframe.pandas import unwrap_partitions, from_partitions
from sklearn.preprocessing import RobustScaler
from sklearn.tree import DecisionTreeClassifier

ray.init()

# Config modin to partition dataframe into 5 partitions and not to partition against columns
cfg.MinPartitionSize.put(102)
cfg.NPartitions.put(5)

# Generate samples
data = np.random.rand(10000, 100)
label = [i for i in range(1, 9)] * 1250
features = ['feature' + str(i) for i in range(1, 101)]
df = pd.DataFrame(data=data, columns=features)
df['label'] = label

# Scale samples
scaler = RobustScaler()
res = scaler.fit_transform(df[[column for column in df.columns if column != 'label']].to_numpy())
frame = pd.DataFrame(res, columns=[column for column in df.columns if column != 'label'])

# Update dataframe
df.update(frame)

# Repartition to make dataframe contain only 1 partition against columns
# This will work
partitions = unwrap_partitions(df, axis=0)
df = from_partitions(partitions, axis=0)
# This will raise an error
# df = df._repartition(axis=1)

# Fit a DTC model of sklearn
clf = DecisionTreeClassifier()
features = df[df.columns.drop(['label'])].to_numpy()
clf.fit(features, label)

Issue Description

I created a dataframe whose shape is (10000,101).
In order to make the df contain only 1 partition against columns, I followed instruction from @YarShev that setting MinPartitionSize would make it.
Then I scaled the df with RobustScaler from sklearn and tried to fit a DTC model.
Yet I found the updated df was partitioned against columns again which made the fitting take about twice as long.
So I tried repartitioning the df only against columns by calling df = df._repartition(axis=1). Yet I got an IndexError.
But I managed to solve the problem by calling unwrap_partitions and from_partitions.

Expected Behavior

df._repartition(axis=1) will make the updated df contain only 1 partition against columns. And the repartitioned df could be feed into DTC.

Error Logs

Traceback (most recent call last):
  File "D:\Work\Python\RayDemo3.8\aaaa.py", line 41, in <module>
    features = df[df.columns.drop(['label'])].to_numpy()
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\pandas\base.py", line 3138, in to_numpy
    return self._to_bare_numpy(
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\pandas\base.py", line 3119, in _to_bare_numpy
    return self._query_compiler.to_numpy(
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\storage_formats\pandas\query_compiler.py", line 376, in to_numpy
    arr = self._modin_frame.to_numpy(**kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\dataframe\pandas\dataframe\dataframe.py", line 3882, in to_numpy
    return self._partition_mgr_cls.to_numpy(self._partitions, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\generic\partitioning\partition_manager.py", line 43, in to_numpy
    parts = RayWrapper.materialize(
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\common\engine_wrapper.py", line 92, in materialize
    return ray.get(obj_id)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\worker.py", line 2667, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\worker.py", line 864, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::_apply_list_of_funcs() (pid=10084, ip=127.0.0.1)
  File "python\ray\_raylet.pyx", line 1889, in ray._raylet.execute_task
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\implementations\pandas_on_ray\partitioning\partition.py", line 440, in _apply_list_of_funcs
    partition = func(partition, *args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\dataframe\pandas\partitioning\partition.py", line 217, in _iloc
    return df.iloc[row_labels, col_labels]
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1097, in __getitem__
    return self._getitem_tuple(key)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1594, in _getitem_tuple
    tup = self._validate_tuple_indexer(tup)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 904, in _validate_tuple_indexer
    self._validate_key(k, i)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1516, in _validate_key
    raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds

Installed Versions

INSTALLED VERSIONS

commit : 0c3746b
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Chinese (Simplified)_China.936

Modin dependencies

modin : 0.23.1.post0
ray : 2.10.0
dask : 2023.5.0
distributed : None
hdk : None

pandas dependencies

pandas : 2.0.3
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.4.6
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.10.0
gcsfs : None
matplotlib : 3.7.4
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : 2.0.25
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
None

Metadata

Assignees

No one assigned

    Labels

    ExternalPull requests and issues from people who do not regularly contribute to modinbug 🦗Something isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions