Internal Error when calling DataFrame.to_numpy on DataFrame.apply() result #1154

agardelein · 2020-03-22T08:23:09Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian bullseye/sid
Modin version (modin.__version__): 0.7.2
Python version: 3.7.5

Describe the problem

When applying a function on a DataFrame, calling to_numpy() on the result lead to the message reported hereafter. Using _to_pandas() first discards the error.

In apply function func, the result is gathered from the actual DataFrame (d0) and another one (d1) provided in argument. The label of data from d1 is inferred from the current one of d0.

Note 1 : a first cal to func() is performed with an empty DataFrame hence is discarded. I can't find the reason of this first call.
Note 2 : on call to apply(), csv3 is passed using _to_pandas() to prevent another error. I can make another bug report if needed.

Source code / logs

The datafile test.dat

p0,p1,data0,data1,data2
A,0,10,11,12
B,1,20,21,22
C,2,30,31,32
D,3,40,41,42
E,4,50,51,52

The datafile test2.dat

p0,p1,data0,data1,data2
A,10,110,111,112
B,11,120,121,122
C,12,130,131,132
D,13,140,141,142
E,14,150,151,152

The source code:

csv2=modin.pandas.read_csv('test.dat',header=0,index_col=[0,1])
csv3=modin.pandas.read_csv('test2.dat',header=0,index_col=[0,1])

def func(d0, d1=None):
    if d0.empty:
        # Needed to skip first call with no data
        print('empty !')
        return None
    l = list(d0.name)
    l[1] = l[1] + 10
    return pd.Series([d0.values, (d1.loc[tuple(l)]).values],
                     index=['a', 'b'], name='pouet')

r = csv2.apply(func, axis=1, result_type='expand', d1=csv3._to_pandas())
rn = r.to_numpy()
# rn = r._to_pandas().to_numpy()

The output:

FutureWarning: pandas.core.index is deprecated and will be removed in a future version.  The public classes are available in the top-level namespace.
UserWarning: User-defined function verification is still under development in Modin. The function provided is not verified.
empty !
Traceback (most recent call last):
  File "./test2.py", line 24, in <module>
    rn = r.to_numpy()
  File "~/.local/lib/python3.7/site-packages/modin/pandas/base.py", line 2962, in to_numpy
    arr = self._query_compiler.to_numpy()
  File "~/.local/lib/python3.7/site-packages/modin/backends/pandas/query_compiler.py", line 169, in to_numpy
    len(arr) != len(self.index) or len(arr[0]) != len(self.columns)
  File "~/.local/lib/python3.7/site-packages/modin/error_message.py", line 40, in catch_bugs_and_request_email
    " caused this error.\n{}".format(extra_log)
Exception: Internal Error. Please email bug_reports@modin.org with the traceback and command that caused this error.

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2020-03-23T15:45:11Z

Thanks for the report @agardelein! Complex apply functions that change the size/shape of the dataframe can be a real challenge internally. The error that gets thrown is when the internal metadata and external metadata do not match.

I am going to put this on 0.7.4 because of the effort involved. It will take some time to do the code analysis component of the UDF to see what the user needs.

We also do not have a capability yet to pass an entire other dataframe to a UDF in apply as you said. Feel free to open a feature request for that as we do want to support it in the future.

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

agardelein added the bug 🦗 Something isn't working label Mar 22, 2020

devin-petersohn added this to the 0.7.4 milestone Mar 23, 2020

dchigarev self-assigned this Jul 28, 2020

dchigarev mentioned this issue Jul 28, 2020

FIX-#1154: properly process UDFs #1845

Merged

6 tasks

dchigarev added a commit to dchigarev/modin that referenced this issue Jul 29, 2020

FIX-modin-project#1154: properly process UDFs

f2e7956

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

devin-petersohn closed this as completed in #1845 Jul 29, 2020

devin-petersohn pushed a commit that referenced this issue Jul 29, 2020

FIX-#1154: properly process UDFs (#1845)

9a65e77

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020

FIX-modin-project#1154: properly process UDFs (modin-project#1845)

0e826ac

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal Error when calling DataFrame.to_numpy on DataFrame.apply() result #1154

Internal Error when calling DataFrame.to_numpy on DataFrame.apply() result #1154

agardelein commented Mar 22, 2020 •

edited

Loading

devin-petersohn commented Mar 23, 2020

Internal Error when calling DataFrame.to_numpy on DataFrame.apply() result #1154

Internal Error when calling DataFrame.to_numpy on DataFrame.apply() result #1154

Comments

agardelein commented Mar 22, 2020 • edited Loading

System information

Describe the problem

Source code / logs

devin-petersohn commented Mar 23, 2020

agardelein commented Mar 22, 2020 •

edited

Loading