Skip to content

TypeError in case of attempt of appending new column when frame has duplicated timestamp indices #2442

@dchigarev

Description

@dchigarev

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
  • Modin version (modin.__version__): 3e32d02
  • Python version: 3.7.5
  • Code we can use to reproduce:
import modin.pandas as pd
import pandas
import numpy as np

data = {"a": np.arange(4)}

# Index with duplicated timestamp
index = pd.to_datetime(["2018-06-05", "2018-06-05", "2018-06-07", "2018-06-08"])

md_df, pd_df = pd.DataFrame(data, index=index), pandas.DataFrame(data, index=index)

pd_df["b"] = pandas.Series(np.zeros(len(pd_df))) # Works fine
md_df["c"] = np.zeros(len(md_df)) # Works fine
md_df["b"] = pd.Series(np.zeros(len(md_df))) # TypeError
Traceback
Traceback (most recent call last):
  File "test_outer.py", line 13, in <module>
    md_df["b"] = pd.Series(np.zeros(len(md_df))) # TypeError
  File "/localdisk/dchigare/repos/modin_bp/modin/pandas/dataframe.py", line 1978, in __setitem__
    join="left",
  File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 303, in concat
    new_modin_frame = self._modin_frame._concat(axis, other_modin_frame, join, sort)
  File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 1850, in _concat
    axis ^ 1, others, how, sort, force_repartition=True
  File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 1701, in _copartition
    joined_index = self._join_index_objects(axis, index_other_obj, how, sort)
  File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 980, in _join_index_objects
    joined_obj = merge_index(joined_obj, obj)
  File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 974, in merge_index
    return obj1.join(obj2, how=how, sort=sort)
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/indexes/datetimelike.py", line 893, in join
    sort=sort,
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3483, in join
    return this.join(other, how=how, return_indexers=return_indexers)
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3494, in join
    other, how=how, return_indexers=return_indexers
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3815, in _join_monotonic
    join_index, lidx, ridx = self._left_indexer(sv, ov)
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 257, in _left_indexer
    return libjoin.left_join_indexer(left, right)
  File "pandas/_libs/join.pyx", line 357, in pandas._libs.join.left_join_indexer
TypeError: '<' not supported between instances of 'Timestamp' and 'int'

Describe the problem

The problem appears only if we're trying to assign modin Series, because if value to assign has query compiler then we will go in the problem branch where we will try to concat self frame with passed Series:

if isinstance(value, Series):
if len(self.columns) == 0:
self._query_compiler = value._query_compiler.copy()
else:
self._create_or_update_from_compiler(
self._query_compiler.concat(
1,
value._query_compiler,
join="left",
),
inplace=True,
)

The problem is that inside qc::concat->modin_frame::_concat->modin_frame::_copartition index joining happens

joined_index = self._join_index_objects(axis, index_other_obj, how, sort)

Which cannot be performed between non-unique TimeStamp and integer index, so we're failing here.

How pandas handle that case?

In pandas insert is separate from concat/join/union function, so it does not joining indices at all. It goes from insert->_sanitize_column that handles index mismatching by reindexing value to self index.

Proposal to fix

Replace calling query_compiler._concat by DataFrame.insert which will be responsible to match indices without joining it

Temporal workarounds

Since the problem appears only with those values that has query compiler, all we need to do is to pass a non-modin object as a value to assign

try:
    df["new_col1"] = pd.Series(np.arange(5))
except ValueError:
    # workaround for #2442
    df["new_col1"] = np.arange(5)

try:
    df["new_col2"] = some_series
except ValueError:
    # workaround for #2442
    df["new_col2"] = some_series.values

Metadata

Metadata

Assignees

Labels

bug 🦗Something isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions