Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow join based on conditional statement #34543

Open
wpdonders opened this issue Jun 3, 2020 · 5 comments
Open

ENH: Allow join based on conditional statement #34543

wpdonders opened this issue Jun 3, 2020 · 5 comments
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@wpdonders
Copy link

In pandas, it is only possible to perform joins based on conditional statements that consider whether they provided keys are equal. It would be great if pandas provided a way in which joins can be performed based on some other logical condition. This is possible in many SQL engines and allows for elegant solution of specific use cases.

Examples

Setup

df1 = pd.DataFrame({'col_a': [1,2,3], 'col_b': ["A", "B", "C"]})
print(df1)
   col_a col_b
0      1     A
1      2     B
2      3     C
df2 = pd.DataFrame({'col_a': [0, 2, 3], 'col_c': ["Z", "X", "Y"]})
print(df2)
   col_a col_c
0      0     Z
1      2     X
2      3     Y

Current behavior (equal joins)

Based on conditional df1.col_a == df2.col_b

df_equal = pd.merge(left=df1, right=df2, on="col_a")
print(df_equal)
   col_a col_b col_c
0      2     B     X
1      3     C     Y

Desired behavior

Not equal to join

Based on conditional df1.col_a != df2.col_a

    col_a   col_b   col_c
0	1	A	Z
1 	1	A	Y
2	1	A	X
3	2	B	Z
4	2	B	X
5	3	C	Z
6	3	C	Y

Greater than join

Based on conditional df1.col_a > df2.col_a

    col_a   col_b   col_c
0	1	A	Z
1	2	B	Z
2	3	C	Z
3	3	C	Y

And any other conditional based on logic operators.

Some considerations

Currently, it it is possible with pandas to join on multiple keys. It would be nice if it is possible to provide multiple logical conditions to perform joining on multiple keys, and allow "mixing" of the logical conditions (i.e. join on df1.col_a != df2.col_a & df1.col_b == df1.col_c). Note also that the second conditional is applied to columns of different names (i.e. col_b of df1 and col_c of df2).

Real-world scenario

Such joins can be very useful in dataframes that have datetime-like objects stored in them. Consider a dataframe df_process that contains process_id process_start_date and process_end_date columns for some business process, and a second dataframe df_events that contains a event_id, event_date column for particular events. If you want to find all events that occur during some business process, you can easily obtain these by applying the conditional join statement:
df_event.event_date >= df_process.process_start_date & df_event.event_date <= df_process.process_end_date

Implementation suggestion

I'm not familiar with how joins are performed in pandas at the moment, but it seems to me that some equality check must be performed to decide to join particular records from the dataframes involved in the join. That means (if this is correct) that there is already some conditional built-in that evaluates to True or False based on the equality (==) comparison. My request should allow passing a different conditional to this part of the pandas join machinery.

Hopefully it's clear what I am requesting!

@wpdonders wpdonders added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 3, 2020
@jreback
Copy link
Contributor

jreback commented Jun 3, 2020

this a duplicate of #7480

though i like your examples more

PRs to implement features are always welcome

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 3, 2020
@jreback jreback added this to the Contributions Welcome milestone Jun 3, 2020
@samukweku
Copy link
Contributor

@wpdonders pyjanitor has a conditional_join function that covers non equi joins. There is also a pending PR with significant performance improvement, which is implemented in numba. Some benchmark performance results are listed in the PR.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@erfannariman
Copy link
Member

erfannariman commented Jan 22, 2023

I don't think these examples make sense, since in real world scenario's they would always be combined with an actual match on some key (id) and additionally with a condition.

To state in SQL terms:

select
    A.id,
    A.date_from
    A.date_to
    B.date_registry
    B.value
from
    A
join
    B
    on A.id = B.id
    and B.date_registry >= A.date_from
    and B.date_registry < A.date_to;

Notice the join clause.

The example in the OP would only work on small datasets, else it would quickly explode. So I think for this issue (and potentially) a PR, it still makes sense to have a on clause, but additional condition maybe?

@samukweku
Copy link
Contributor

@erfannariman not all joins have an equality condition; there are scenarios where you dont have, especially range joins. Admittedly, it would be faster if there was an equality condition. Infact, the OP provided a real world scenario of range joins where there is no equality clause. This article from duckdb goes a lot further to highlight the use cases, especially for timeseries joins. So, an implementation that is strictly non-equi is valuable as well.

I implemented conditional_join function in pyjanitor that covers non equi joins, and is more efficient than a naive cartesian join. Depending on the data size, you could get more performance with numba. It would be great if this was implemented directly in Pandas, with an API that integrates better with pd.merge, and even much faster implementation from the pandas developers.

Performance wise it doesnt do badly, compared with DuckDb's:

In [1]: import pandas as pd; import numpy as np; import janitor as jn; import duckdb

# adapted from DuckDB's repo
In [2]: url = 'https://github.com/samukweku/data-wrangling-blog/raw/master/_notebooks/Data_files/results.csv'
   ...: events = pd.read_csv(url, parse_dates=['start', 'end']).iloc[:, 1:]
   ...: events.head()
Out[2]: 
   id     name  audience               start     sponsor                 end
0   1  Event 1      1178 2022-11-19 10:00:00   Sponsor 2 2022-11-19 10:15:00
1   2  Event 2      1446 2015-09-27 15:00:00  Sponsor 11 2015-09-27 15:11:00
2   3  Event 3      2261 2019-11-12 18:00:00  Sponsor 10 2019-11-12 18:53:00
3   4  Event 4      1471 2019-12-24 22:00:00   Sponsor 6 2019-12-24 22:11:00
4   5  Event 5      2605 2028-06-20 12:00:00   Sponsor 8 2028-06-20 12:31:00

In [3]: out = (events
   ...: .conditional_join(
   ...:     events,
   ...:     ('start', 'end', '<='),
   ...:     ('end', 'start', '>='),
   ...:     ('id', 'id', '!='),
   ...:     use_numba = False,
   ...:     df_columns = ['id', 'start', 'end'],
   ...:     right_columns = ['id', 'start', 'end'])
   ...: )

In [4]: out.head()
Out[4]: 
  left                                          right                                        
    id               start                 end     id               start                 end
0   10 1993-11-27 12:00:00 1993-11-27 12:37:00   2345 1993-11-27 10:00:00 1993-11-27 12:00:00
1   15 1993-04-04 16:00:00 1993-04-04 18:00:00  11178 1993-04-04 17:00:00 1993-04-04 17:22:00
2   17 2030-10-25 07:00:00 2030-10-25 07:27:00  19605 2030-10-25 06:00:00 2030-10-25 08:00:00
3   26 2005-10-04 17:00:00 2005-10-04 17:18:00   8218 2005-10-04 17:00:00 2005-10-04 17:27:00
4   35 2024-05-02 15:00:00 2024-05-02 15:35:00   6916 2024-05-02 15:00:00 2024-05-02 15:36:00

In [5]: %%timeit
   ...: (events
   ...: .conditional_join(
   ...:     events,
   ...:     ('start', 'end', '<='),
   ...:     ('end', 'start', '>='),
   ...:     ('id', 'id', '!='),
   ...:     use_numba = False,
   ...:     df_columns = ['id', 'start', 'end'],
   ...:     right_columns = ['id', 'start', 'end'])
   ...: )
   ...: 
   ...: 
843 ms ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %%timeit
   ...: (events
   ...: .conditional_join(
   ...:     events,
   ...:     ('start', 'end', '<='),
   ...:     ('end', 'start', '>='),
   ...:     ('id', 'id', '!='),
   ...:     use_numba = True,
   ...:     df_columns = ['id', 'start', 'end'],
   ...:     right_columns = ['id', 'start', 'end'])
   ...: )
   ...: 
   ...: 
16.5 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]: con = duckdb.connect()
   ...: 

In [9]: event_count = """SELECT COUNT(*) FROM (
   ...: ^ISELECT r.id, s.id
   ...: ^IFROM events r, events s
   ...: ^IWHERE r.start <= s.end AND r.end >= s.start
   ...: ^I  AND r.id <> s.id
   ...: ) q2;"""

In [10]: %timeit con.execute(event_count)
    ...: 
17.2 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: con.execute(event_count)
Out[11]: <duckdb.DuckDBPyConnection at 0x7ffb884539b0>

In [12]: con.execute(event_count).fetchone()
Out[12]: (3702,)

In [13]: out.shape
Out[13]: (3702, 6)

We can go larger on a 10 million row table, again from DuckDB:

In [18]: ##### Tax audits
    ...: query = """CREATE TYPE surname_t AS ENUM (
    ...:     'Smith',
    ...:     'Johnson',
    ...:     'Williams',
    ...:     'Jones',
    ...:     'Brown',
    ...:     'Davis',
    ...:     'Miller',
    ...:     'Wilson',
    ...:     'Moore',
    ...:     'Taylor',
    ...:     'Anderson',
    ...:     'Thomas',
    ...:     'Jackson',
    ...:     'White',
    ...:     'Harris',
    ...:     'Martin',
    ...:     'Thompson',
    ...:     'Garcia',
    ...:     'Martinez',
    ...:     'Robinson'
    ...: );
    ...: SELECT SETSEED(0.8675309);
    ...: CREATE TABLE employees AS
    ...: SELECT
    ...: facts.id AS id,
    ...: surname AS "name",
    ...: dept,
    ...: salary,
    ...: (salary / 10 - CASE WHEN random() <= 0.01 THEN (10 + 1) ELSE 0 END)::INTEGER AS tax
    ...: FROM (
    ...: SELECT
    ...: id,
    ...: enum_range(NULL::surname_t)[(round(random() * 19))::INTEGER] AS surname,
    ...: round(random() * 5)::INTEGER  AS dept,
    ...: 100 * id AS salary
    ...: FROM (SELECT UNNEST(range(1, 10000000))) tbl(id)
    ...: ) facts
    ...: ;
    ...: """

In [19]: con = duckdb.connect()
    ...: con.execute(query)
    ...: counts = """SELECT COUNT(*) FROM (
    ...: ^ISELECT r.id, s.id
    ...: ^IFROM employees r, employees s
    ...: ^IWHERE r.salary < s.salary AND r.tax > s.tax
    ...: ) q1;"""

In [20]: %timeit con.execute(counts)
6.02 s ± 716 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [21]: # creata Pandas dataframe
    ...: employees = con.execute("select * from employees").df()

In [22]: # long running process, using numba here

In [23]: emp = (employees
    ...: .conditional_join(
    ...:     employees,
    ...:     ('salary', 'salary', '<'),
    ...:     ('tax', 'tax', '>'),
    ...:     use_numba = True)
    ...: )

In [24]: %%timeit
    ...: (employees
    ...: .conditional_join(
    ...:     employees,
    ...:     ('salary', 'salary', '<'),
    ...:     ('tax', 'tax', '>'),
    ...:     use_numba = True)
    ...: )
    ...: 
    ...: 
3.08 s ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [25]: con.execute(counts).fetchone()
Out[25]: (98805,)

In [26]: emp.shape
Out[26]: (98805, 10)

non-equi joins might not be as widely used as equi-joins, but they still have their place - implementing it within Pandas might not be too hard - just need to figure out a proper way to integrate it with the current merge API.

@LakshmanKishore
Copy link
Contributor

I fully support the proposal for introducing conditional joins in pandas, as it aligns with SQL's capability to perform conditional joins. This enhancement would greatly benefit users dealing with complex join scenarios.

My suggestion for the syntax is as follows:

Example Usage:

result = pd.merge(
    dfa,
    dfb,
    how="inner",
    conditions=[
        [('cola1', '>', 'colb1'), ('cola2', '<', 'colb2')],
        ('cola3', '=', 'colb3')
    ]
)

Description:

Condition Structure:

Each condition is represented by a tuple (column, operator, column). For example:

[('cola1', '>', 'colb1'), ('cola2', '<', 'colb2')],

This indicates an OR condition where either cola1 in dfa should be greater than colb1 in dfb, or cola2 in dfa should be less than colb2 in dfb.

@mroeschke mroeschke mentioned this issue Sep 3, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

6 participants