-
Notifications
You must be signed in to change notification settings - Fork 670
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
- Modin version (
modin.__version__): 3e32d02 - Python version: 3.7.5
- Code we can use to reproduce:
import modin.pandas as pd
import pandas
import numpy as np
data = {"a": np.arange(4)}
# Index with duplicated timestamp
index = pd.to_datetime(["2018-06-05", "2018-06-05", "2018-06-07", "2018-06-08"])
md_df, pd_df = pd.DataFrame(data, index=index), pandas.DataFrame(data, index=index)
pd_df["b"] = pandas.Series(np.zeros(len(pd_df))) # Works fine
md_df["c"] = np.zeros(len(md_df)) # Works fine
md_df["b"] = pd.Series(np.zeros(len(md_df))) # TypeErrorTraceback
Traceback (most recent call last):
File "test_outer.py", line 13, in <module>
md_df["b"] = pd.Series(np.zeros(len(md_df))) # TypeError
File "/localdisk/dchigare/repos/modin_bp/modin/pandas/dataframe.py", line 1978, in __setitem__
join="left",
File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 303, in concat
new_modin_frame = self._modin_frame._concat(axis, other_modin_frame, join, sort)
File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 1850, in _concat
axis ^ 1, others, how, sort, force_repartition=True
File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 1701, in _copartition
joined_index = self._join_index_objects(axis, index_other_obj, how, sort)
File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 980, in _join_index_objects
joined_obj = merge_index(joined_obj, obj)
File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 974, in merge_index
return obj1.join(obj2, how=how, sort=sort)
File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/indexes/datetimelike.py", line 893, in join
sort=sort,
File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3483, in join
return this.join(other, how=how, return_indexers=return_indexers)
File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3494, in join
other, how=how, return_indexers=return_indexers
File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3815, in _join_monotonic
join_index, lidx, ridx = self._left_indexer(sv, ov)
File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 257, in _left_indexer
return libjoin.left_join_indexer(left, right)
File "pandas/_libs/join.pyx", line 357, in pandas._libs.join.left_join_indexer
TypeError: '<' not supported between instances of 'Timestamp' and 'int'
Describe the problem
The problem appears only if we're trying to assign modin Series, because if value to assign has query compiler then we will go in the problem branch where we will try to concat self frame with passed Series:
modin/modin/pandas/dataframe.py
Lines 1970 to 1981 in 3e32d02
| if isinstance(value, Series): | |
| if len(self.columns) == 0: | |
| self._query_compiler = value._query_compiler.copy() | |
| else: | |
| self._create_or_update_from_compiler( | |
| self._query_compiler.concat( | |
| 1, | |
| value._query_compiler, | |
| join="left", | |
| ), | |
| inplace=True, | |
| ) |
The problem is that inside qc::concat->modin_frame::_concat->modin_frame::_copartition index joining happens
modin/modin/engines/base/frame/data.py
Line 1750 in 3e32d02
| joined_index = self._join_index_objects(axis, index_other_obj, how, sort) |
Which cannot be performed between non-unique TimeStamp and integer index, so we're failing here.
How pandas handle that case?
In pandas insert is separate from concat/join/union function, so it does not joining indices at all. It goes from insert->_sanitize_column that handles index mismatching by reindexing value to self index.
Proposal to fix
Replace calling query_compiler._concat by DataFrame.insert which will be responsible to match indices without joining it
Temporal workarounds
Since the problem appears only with those values that has query compiler, all we need to do is to pass a non-modin object as a value to assign
try:
df["new_col1"] = pd.Series(np.arange(5))
except ValueError:
# workaround for #2442
df["new_col1"] = np.arange(5)
try:
df["new_col2"] = some_series
except ValueError:
# workaround for #2442
df["new_col2"] = some_series.values