You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian bullseye/sid
Modin version (modin.__version__): 0.7.2
Python version: 3.7.5
Describe the problem
When applying a function on a DataFrame, calling to_numpy() on the result lead to the message reported hereafter. Using _to_pandas() first discards the error.
In apply function func, the result is gathered from the actual DataFrame (d0) and another one (d1) provided in argument. The label of data from d1 is inferred from the current one of d0.
Note 1 : a first cal to func() is performed with an empty DataFrame hence is discarded. I can't find the reason of this first call.
Note 2 : on call to apply(), csv3 is passed using _to_pandas() to prevent another error. I can make another bug report if needed.
csv2=modin.pandas.read_csv('test.dat',header=0,index_col=[0,1])
csv3=modin.pandas.read_csv('test2.dat',header=0,index_col=[0,1])
def func(d0, d1=None):
if d0.empty:
# Needed to skip first call with no data
print('empty !')
return None
l = list(d0.name)
l[1] = l[1] + 10
return pd.Series([d0.values, (d1.loc[tuple(l)]).values],
index=['a', 'b'], name='pouet')
r = csv2.apply(func, axis=1, result_type='expand', d1=csv3._to_pandas())
rn = r.to_numpy()
# rn = r._to_pandas().to_numpy()
The output:
FutureWarning: pandas.core.index is deprecated and will be removed in a future version. The public classes are available in the top-level namespace.
UserWarning: User-defined function verification is still under development in Modin. The function provided is not verified.
empty !
Traceback (most recent call last):
File "./test2.py", line 24, in <module>
rn = r.to_numpy()
File "~/.local/lib/python3.7/site-packages/modin/pandas/base.py", line 2962, in to_numpy
arr = self._query_compiler.to_numpy()
File "~/.local/lib/python3.7/site-packages/modin/backends/pandas/query_compiler.py", line 169, in to_numpy
len(arr) != len(self.index) or len(arr[0]) != len(self.columns)
File "~/.local/lib/python3.7/site-packages/modin/error_message.py", line 40, in catch_bugs_and_request_email
" caused this error.\n{}".format(extra_log)
Exception: Internal Error. Please email bug_reports@modin.org with the traceback and command that caused this error.
The text was updated successfully, but these errors were encountered:
Thanks for the report @agardelein! Complex apply functions that change the size/shape of the dataframe can be a real challenge internally. The error that gets thrown is when the internal metadata and external metadata do not match.
I am going to put this on 0.7.4 because of the effort involved. It will take some time to do the code analysis component of the UDF to see what the user needs.
We also do not have a capability yet to pass an entire other dataframe to a UDF in apply as you said. Feel free to open a feature request for that as we do want to support it in the future.
System information
modin.__version__
): 0.7.2Describe the problem
When applying a function on a DataFrame, calling
to_numpy()
on the result lead to the message reported hereafter. Using_to_pandas()
first discards the error.In apply function
func
, the result is gathered from the actual DataFrame (d0
) and another one (d1
) provided in argument. The label of data fromd1
is inferred from the current one ofd0
.Note 1 : a first cal to
func()
is performed with an empty DataFrame hence is discarded. I can't find the reason of this first call.Note 2 : on call to
apply()
,csv3
is passedusing _to_pandas()
to prevent another error. I can make another bug report if needed.Source code / logs
The datafile test.dat
The datafile test2.dat
The source code:
The output:
The text was updated successfully, but these errors were encountered: