Description
Is your feature request related to a problem?
I'm interested in adding additional options that would modify the behavior of merge, join, and concatenate operations when performed between IntervalArrays or IntervalIndexes.
There has been a fair amount of related discussion in a number of somewhat stale issues. I'd be happy to continue the discussion on one of those if it makes more sense.
- Features which Interval / IntervalIndex should probably have #19480
- Add spec for new Interval / IntervalIndex methods: .overlaps(), .covers() #18975
- ENH: Implement overlaps method for Interval-like #22939
- ENH: Interval type should support intersection, union & overlaps & difference #21998
Describe the solution you'd like
I'd like to be able to do merges on joins using Intervals where instead of finding exact matches the overlapping ranges are matched and split up as needed. Suppose we have the following two DataFrames.
idx_A = pd.IntervalIndex.from_tuples([(1, 4), (4, 6), (8, 9)])
df_A = pd.DataFrame(data={'A':['A1', 'A2', 'A3']}, index=idx_A)
idx_B = pd.IntervalIndex.from_tuples([(2, 3), (3, 5), (5, 10)])
df_B = pd.DataFrame(data={'B':['B1', 'B2', 'B3']}, index=idx_B)
Currently, an outer join would not match any of the rows at all because it only joins Intervals that are equal.
df_A.join(df_B, how='outer')
I'm interested in having an option that would change the behavior so that the overlapping portions of the intervals would be aligned. So the result would look like this.
df_a.join(df_B, how='overlapping') # not at all wedded to that parameter
idx_expected = pd.IntervalIndex.from_tuples([
(1,2), (2,3), (3,4), (4,5),
(5,6), (6,9),(9,10)])
df_expected = pd.DataFrame(
index=idx_expected, data={
'A': ['A1', 'A1', 'A1', 'A1', 'A2', 'A2', np.nan],
'B': [np.nan, 'B1', 'B2', 'B2', 'B3', 'B3', 'B3']
}
)
Ideally, the behavior would extend to multiple DataFrames or Series. So concat would work as well.
idx_C = pd.IntervalIndex.from_tuples([(3, 5), (8, 12), (14, 20)])
df_C = pd.DataFrame(data={'C':['C1', 'C2', 'C3']}, index=idx_C)
concat_df = pd.concat([df_A, df_B, df_C], axis=1)
API breaking implications
I imagine it would be best to avoid breaking the existing behavior of merge, join, concat, etc. on IntervalArrays. I could see either creating new functions (like pd.merge_asof
) or adding either additional parameters to the various methods or additional options to existing parameters (such as how
).
Describe alternatives you've considered
I've considered just writing my own functions on that use IntervalArrays to achieve the following and looked at a number of other libraries.
Additional context
Some other related concepts or implementations: