Skip to content

ENH: IntervalArray/IntervalIndex Range Joins #43031

Open
@sterlinm

Description

@sterlinm

Is your feature request related to a problem?

I'm interested in adding additional options that would modify the behavior of merge, join, and concatenate operations when performed between IntervalArrays or IntervalIndexes.

There has been a fair amount of related discussion in a number of somewhat stale issues. I'd be happy to continue the discussion on one of those if it makes more sense.

Describe the solution you'd like

I'd like to be able to do merges on joins using Intervals where instead of finding exact matches the overlapping ranges are matched and split up as needed. Suppose we have the following two DataFrames.

idx_A = pd.IntervalIndex.from_tuples([(1, 4), (4, 6), (8, 9)])
df_A = pd.DataFrame(data={'A':['A1', 'A2', 'A3']}, index=idx_A)

idx_B = pd.IntervalIndex.from_tuples([(2, 3), (3, 5), (5, 10)])
df_B = pd.DataFrame(data={'B':['B1', 'B2', 'B3']}, index=idx_B)

Screen Shot 2021-08-13 at 8 57 04 PM

Currently, an outer join would not match any of the rows at all because it only joins Intervals that are equal.

df_A.join(df_B, how='outer')

Screen Shot 2021-08-13 at 9 00 15 PM

I'm interested in having an option that would change the behavior so that the overlapping portions of the intervals would be aligned. So the result would look like this.

df_a.join(df_B, how='overlapping') # not at all wedded to that parameter
idx_expected = pd.IntervalIndex.from_tuples([
    (1,2), (2,3), (3,4), (4,5),
    (5,6), (6,9),(9,10)])

df_expected = pd.DataFrame(
    index=idx_expected, data={
        'A': ['A1', 'A1', 'A1', 'A1', 'A2', 'A2', np.nan],
        'B': [np.nan, 'B1', 'B2', 'B2', 'B3', 'B3', 'B3']
    }
    
)

Screen Shot 2021-08-13 at 9 06 22 PM

Ideally, the behavior would extend to multiple DataFrames or Series. So concat would work as well.

idx_C = pd.IntervalIndex.from_tuples([(3, 5), (8, 12), (14, 20)])
df_C = pd.DataFrame(data={'C':['C1', 'C2', 'C3']}, index=idx_C)

concat_df = pd.concat([df_A, df_B, df_C], axis=1)

Screen Shot 2021-08-13 at 9 14 15 PM

Screen Shot 2021-08-13 at 9 14 23 PM

API breaking implications

I imagine it would be best to avoid breaking the existing behavior of merge, join, concat, etc. on IntervalArrays. I could see either creating new functions (like pd.merge_asof) or adding either additional parameters to the various methods or additional options to existing parameters (such as how).

Describe alternatives you've considered

I've considered just writing my own functions on that use IntervalArrays to achieve the following and looked at a number of other libraries.

Additional context

Some other related concepts or implementations:

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIndexRelated to the Index class or subclassesIntervalInterval data typeReshapingConcat, Merge/Join, Stack/Unstack, Explodesetopsunion, intersection, difference, symmetric_difference

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions