Skip to content

BUG: [REGRESSION] concat fails when concating two objects with overlapping MultiIndex IntervalIndex levels #54934

Closed
@johannes-mueller

Description

@johannes-mueller

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

ivl1 = pd.IntervalIndex.from_breaks([0.0, 1.0, 2.0])
ivl2 = pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5])

mi1 = pd.MultiIndex.from_product([ivl1, ivl1])
mi2 = pd.MultiIndex.from_product([ivl2, ivl2])
s1 = pd.Series(1, index=mi1)
s2 = pd.Series(2, index=mi2)

expected_idx = pd.MultiIndex.from_tuples(
    [
        (pd.Interval(0.0, 1.0), pd.Interval(0.0, 1.0)),
        (pd.Interval(0.0, 1.0), pd.Interval(1.0, 2.0)),
        (pd.Interval(1.0, 2.0), pd.Interval(0.0, 1.0)),
        (pd.Interval(1.0, 2.0), pd.Interval(1.0, 2.0)),
        (pd.Interval(0.5, 1.5), pd.Interval(0.5, 1.5)),
        (pd.Interval(0.5, 1.5), pd.Interval(1.5, 2.5)),
        (pd.Interval(1.5, 2.5), pd.Interval(0.5, 1.5)),
        (pd.Interval(1.5, 2.5), pd.Interval(1.5, 2.5))
    ]
)
expected = pd.Series([1, 1, 1, 1, 2, 2, 2, 2], index=expected_idx)

result = pd.concat([s1, s2])

pd.testing.assert_series_equal(result, expected)

Issue Description

The code crashes with

Traceback (most recent call last):
  File "/tmp/concattest.py", line 26, in <module>
    result = pd.concat([s1, s2])
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 393, in concat
    return op.get_result()
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 640, in get_result
    new_index = self.new_axes[0]
  File "pandas/_libs/properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
    val = self.fget(obj)
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 698, in new_axes
    return [
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 699, in <listcomp>
    self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i)
  File "pandas/_libs/properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
    val = self.fget(obj)
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 756, in _get_concat_axis
    concat_axis = _concat_indexes(indexes)
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 774, in _concat_indexes
    return indexes[0].append(indexes[1:])
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/multi.py", line 2184, in append
    level_codes = [
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/multi.py", line 2185, in <listcomp>
    recode_for_categories(
  File "/home/jmu3si/Devel/pandas/pandas/core/arrays/categorical.py", line 2951, in recode_for_categories
    new_categories.get_indexer(old_categories), new_categories
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 3845, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: cannot handle overlapping indices; use IntervalIndex.get_indexer_non_unique

This used to work in 2.0.3. After bisecting it turns out that the performance optimization of f989e1b is breaking it. @lukemanley: any ideas how to fix this reasonably?

Expected Behavior

The code should finish without error.

Installed Versions

INSTALLED VERSIONS ------------------ commit : c7325d7 python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-159-generic Version : #176-Ubuntu SMP Mon Aug 14 12:04:20 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : de_DE.UTF-8 LOCALE : de_DE.UTF-8

pandas : 2.2.0dev0+155.gc7325d7e7e
numpy : 1.24.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : 0.29.33
pytest : 7.4.0
hypothesis : 6.83.0
sphinx : 6.2.1
blosc : 1.11.1
feather : None
xlsxwriter : 3.1.2
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.7
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.2
numba : 0.57.1
numexpr : 2.8.5
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : 1.2.3
pyxlsb : 1.0.10
s3fs : 2023.6.0
scipy : 1.11.2
sqlalchemy : 2.0.20
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.8.0
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIntervalInterval data typeMultiIndexRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions