Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: AssertionError when subtracting Timestamp-valued DataFrames with non-indentical column index #31623

Closed
DomKennedy opened this issue Feb 3, 2020 · 8 comments · Fixed by #31679
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@DomKennedy
Copy link

DomKennedy commented Feb 3, 2020

import pandas as pd

df = pd.DataFrame(
    {
        "foo": [pd.Timestamp("2019"), pd.Timestamp("2020")],
        "bar": [pd.Timestamp("2018"), pd.Timestamp("2021")],
    }
)

df2 = df[["foo"]]

print(df - df2)

Problem description

The above snippet raises the following exception:

Traceback (most recent call last):
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 149, in na_arithmetic_op
    result = expressions.evaluate(op, str_rep, left, right)
  File ".v
env/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 208, in evaluate
    return _evaluate(op, op_str, a, b)
  File ".venv/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 70, in _evaluate_standard
    return op(a, b)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/common.py", line 64, in new_method
    return method(self, other)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 500, in wrapper
    result = arithmetic_op(lvalues, rvalues, op, str_rep)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 192, in arithmetic_op
    res_values = dispatch_to_extension_op(op, lvalues, rvalues)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/dispatch.py", line 125, in dispatch_to_extension_op
    res_values = op(left, right)
  File ".venv/lib/python3.6/site-packages/pandas/core/arrays/datetimelike.py", line 1390, in __rsub__
    f"cannot subtract {type(self).__name__} from {type(other).__name__}"
TypeError: cannot subtract DatetimeArray from ndarray

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pandas_bug.py", line 36, in <module>
    print(df2 - df)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 703, in f
    new_data = left._combine_frame(right, pass_op, fill_value)
  File ".venv/lib/python3.6/site-packages/pandas/core/frame.py", line 5297, in _combine_frame
    new_data = ops.dispatch_to_series(self, other, _arith_op)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 416, in dispatch_to_series
    new_data = expressions.evaluate(column_op, str_rep, left, right)
  File ".venv/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 208, in evaluate
    return _evaluate(op, op_str, a, b)
  File ".venv/lib/python3.6/site-packages/pandas/core/computation/expressions.py", line 70, in _evaluate_standard
    return op(a, b)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 385, in column_op
    return {i: func(a.iloc[:, i], b.iloc[:, i]) for i in range(len(a.columns))}
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 385, in <dictcomp>
    return {i: func(a.iloc[:, i], b.iloc[:, i]) for i in range(len(a.columns))}
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 121, in na_op
    return na_arithmetic_op(x, y, op, str_rep)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 151, in na_arithmetic_op
    result = masked_arith_op(left, right, op)
  File ".venv/lib/python3.6/site-packages/pandas/core/ops/array_ops.py", line 75, in masked_arith_op
    assert isinstance(x, np.ndarray), type(x)

This is a 1.0.0 regression; in 0.25.3, the operation succeeds and the unmatched bar column is filled with NaN in the output.

The same error occurs with:

  • Any combination of incompatible columns (strict subset, strict superset, overlapping, disjoint)
  • Calling the subtract method instead of using the subtraction operator
  • Timezone-aware Timestamps as well as timezone-naive

It does not seem to occur with:

  • Mismatches on the row index; transposing the dataframes in the above example prevents the errors occuring.
  • pd.Series objects with mismatched indexes (e.g. calling the above on the first row of each dataframe works fine)
  • Other dtypes; bool, float, and int seem to work fine. Similarly, if the dataframes are explicitly cast to dtype object, the operation succeeds.

Expected Output

   bar    foo
0  NaN 0 days
1  NaN 0 days

Output of pd.show_versions()

``` INSTALLED VERSIONS ------------------ commit : None python : 3.6.8.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-74-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0
Cython : None
pytest : 5.3.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None


</details>
@DomKennedy DomKennedy changed the title REGR: AssertionError when when subtracting Timestamp-valued DataFrames with non-indentical column index REGR: AssertionError when subtracting Timestamp-valued DataFrames with non-indentical column index Feb 3, 2020
@TomAugspurger
Copy link
Contributor

Thanks for the report. The NaNs are introduced in

self, other = _align_method_FRAME(self, other, axis, flex=True, level=level)
, which calls DataFrame.align.

I wonder, should this be changed?

In [6]: df.align(df2)[1]
Out[6]:
   bar        foo
0  NaN 2019-01-01
1  NaN 2020-01-01

to have bar be datetime64[ns] dtype, to match the left?

@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 3, 2020
@TomAugspurger
Copy link
Contributor

cc @jbrockmendel.

@jbrockmendel
Copy link
Member

ill look at this today

@jorisvandenbossche jorisvandenbossche added this to the 1.0.1 milestone Feb 4, 2020
@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Feb 4, 2020
@jbrockmendel
Copy link
Member

So this is pretty ugly, but one option that tentatively works is to patch ops._arith_method_FRAME so that we only operate on shared columns, then reindex the result.

@jbrockmendel
Copy link
Member

might actually improve perf for cases where we have very few shared columns

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 4, 2020

That seems reasonable. The alternative is to ensure that the correct fill_value is used in align, which seems difficult since we'd potentially have different fill values for different columns / dtypes.

Is that likely to cause issues with methods like DataFrame.add? I forget whether the fill_value from add is done before or after the op.

@jbrockmendel
Copy link
Member

The alternative is to ensure that the correct fill_value is used in align, which seems difficult since we'd potentially have different fill values for different columns / dtypes.

yah, it would also depend on op, which would become a nightmare.

I'll put up a proof of concept in a bit

@TomAugspurger
Copy link
Contributor

@DomKennedy in the meantime, here's a workaround

In [14]: import operator

In [15]: operator.sub(*df.align(df2, fill_value=pd.NaT))
Out[15]:
  bar    foo
0 NaT 0 days
1 NaT 0 days

There are lots of issues with that (if you have other columns that don't align, NaT won't be the right fill value) but hopefully not too bad for now.

We'll try to get this fixed properly for 1.0.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants