-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow join based on conditional statement #34543
Comments
this a duplicate of #7480 though i like your examples more PRs to implement features are always welcome |
@wpdonders pyjanitor has a conditional_join function that covers non equi joins. There is also a pending PR with significant performance improvement, which is implemented in numba. Some benchmark performance results are listed in the PR. |
I don't think these examples make sense, since in real world scenario's they would always be combined with an actual match on some key (id) and additionally with a condition. To state in SQL terms: select
A.id,
A.date_from
A.date_to
B.date_registry
B.value
from
A
join
B
on A.id = B.id
and B.date_registry >= A.date_from
and B.date_registry < A.date_to; Notice the join clause. The example in the OP would only work on small datasets, else it would quickly explode. So I think for this issue (and potentially) a PR, it still makes sense to have a |
@erfannariman not all joins have an equality condition; there are scenarios where you dont have, especially range joins. Admittedly, it would be faster if there was an equality condition. Infact, the OP provided a real world scenario of range joins where there is no equality clause. This article from duckdb goes a lot further to highlight the use cases, especially for timeseries joins. So, an implementation that is strictly non-equi is valuable as well. I implemented conditional_join function in pyjanitor that covers non equi joins, and is more efficient than a naive cartesian join. Depending on the data size, you could get more performance with Performance wise it doesnt do badly, compared with DuckDb's: In [1]: import pandas as pd; import numpy as np; import janitor as jn; import duckdb
# adapted from DuckDB's repo
In [2]: url = 'https://github.com/samukweku/data-wrangling-blog/raw/master/_notebooks/Data_files/results.csv'
...: events = pd.read_csv(url, parse_dates=['start', 'end']).iloc[:, 1:]
...: events.head()
Out[2]:
id name audience start sponsor end
0 1 Event 1 1178 2022-11-19 10:00:00 Sponsor 2 2022-11-19 10:15:00
1 2 Event 2 1446 2015-09-27 15:00:00 Sponsor 11 2015-09-27 15:11:00
2 3 Event 3 2261 2019-11-12 18:00:00 Sponsor 10 2019-11-12 18:53:00
3 4 Event 4 1471 2019-12-24 22:00:00 Sponsor 6 2019-12-24 22:11:00
4 5 Event 5 2605 2028-06-20 12:00:00 Sponsor 8 2028-06-20 12:31:00
In [3]: out = (events
...: .conditional_join(
...: events,
...: ('start', 'end', '<='),
...: ('end', 'start', '>='),
...: ('id', 'id', '!='),
...: use_numba = False,
...: df_columns = ['id', 'start', 'end'],
...: right_columns = ['id', 'start', 'end'])
...: )
In [4]: out.head()
Out[4]:
left right
id start end id start end
0 10 1993-11-27 12:00:00 1993-11-27 12:37:00 2345 1993-11-27 10:00:00 1993-11-27 12:00:00
1 15 1993-04-04 16:00:00 1993-04-04 18:00:00 11178 1993-04-04 17:00:00 1993-04-04 17:22:00
2 17 2030-10-25 07:00:00 2030-10-25 07:27:00 19605 2030-10-25 06:00:00 2030-10-25 08:00:00
3 26 2005-10-04 17:00:00 2005-10-04 17:18:00 8218 2005-10-04 17:00:00 2005-10-04 17:27:00
4 35 2024-05-02 15:00:00 2024-05-02 15:35:00 6916 2024-05-02 15:00:00 2024-05-02 15:36:00
In [5]: %%timeit
...: (events
...: .conditional_join(
...: events,
...: ('start', 'end', '<='),
...: ('end', 'start', '>='),
...: ('id', 'id', '!='),
...: use_numba = False,
...: df_columns = ['id', 'start', 'end'],
...: right_columns = ['id', 'start', 'end'])
...: )
...:
...:
843 ms ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %%timeit
...: (events
...: .conditional_join(
...: events,
...: ('start', 'end', '<='),
...: ('end', 'start', '>='),
...: ('id', 'id', '!='),
...: use_numba = True,
...: df_columns = ['id', 'start', 'end'],
...: right_columns = ['id', 'start', 'end'])
...: )
...:
...:
16.5 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: con = duckdb.connect()
...:
In [9]: event_count = """SELECT COUNT(*) FROM (
...: ^ISELECT r.id, s.id
...: ^IFROM events r, events s
...: ^IWHERE r.start <= s.end AND r.end >= s.start
...: ^I AND r.id <> s.id
...: ) q2;"""
In [10]: %timeit con.execute(event_count)
...:
17.2 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: con.execute(event_count)
Out[11]: <duckdb.DuckDBPyConnection at 0x7ffb884539b0>
In [12]: con.execute(event_count).fetchone()
Out[12]: (3702,)
In [13]: out.shape
Out[13]: (3702, 6) We can go larger on a 10 million row table, again from DuckDB: In [18]: ##### Tax audits
...: query = """CREATE TYPE surname_t AS ENUM (
...: 'Smith',
...: 'Johnson',
...: 'Williams',
...: 'Jones',
...: 'Brown',
...: 'Davis',
...: 'Miller',
...: 'Wilson',
...: 'Moore',
...: 'Taylor',
...: 'Anderson',
...: 'Thomas',
...: 'Jackson',
...: 'White',
...: 'Harris',
...: 'Martin',
...: 'Thompson',
...: 'Garcia',
...: 'Martinez',
...: 'Robinson'
...: );
...: SELECT SETSEED(0.8675309);
...: CREATE TABLE employees AS
...: SELECT
...: facts.id AS id,
...: surname AS "name",
...: dept,
...: salary,
...: (salary / 10 - CASE WHEN random() <= 0.01 THEN (10 + 1) ELSE 0 END)::INTEGER AS tax
...: FROM (
...: SELECT
...: id,
...: enum_range(NULL::surname_t)[(round(random() * 19))::INTEGER] AS surname,
...: round(random() * 5)::INTEGER AS dept,
...: 100 * id AS salary
...: FROM (SELECT UNNEST(range(1, 10000000))) tbl(id)
...: ) facts
...: ;
...: """
In [19]: con = duckdb.connect()
...: con.execute(query)
...: counts = """SELECT COUNT(*) FROM (
...: ^ISELECT r.id, s.id
...: ^IFROM employees r, employees s
...: ^IWHERE r.salary < s.salary AND r.tax > s.tax
...: ) q1;"""
In [20]: %timeit con.execute(counts)
6.02 s ± 716 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [21]: # creata Pandas dataframe
...: employees = con.execute("select * from employees").df()
In [22]: # long running process, using numba here
In [23]: emp = (employees
...: .conditional_join(
...: employees,
...: ('salary', 'salary', '<'),
...: ('tax', 'tax', '>'),
...: use_numba = True)
...: )
In [24]: %%timeit
...: (employees
...: .conditional_join(
...: employees,
...: ('salary', 'salary', '<'),
...: ('tax', 'tax', '>'),
...: use_numba = True)
...: )
...:
...:
3.08 s ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [25]: con.execute(counts).fetchone()
Out[25]: (98805,)
In [26]: emp.shape
Out[26]: (98805, 10) non-equi joins might not be as widely used as equi-joins, but they still have their place - implementing it within Pandas might not be too hard - just need to figure out a proper way to integrate it with the current |
I fully support the proposal for introducing conditional joins in pandas, as it aligns with SQL's capability to perform conditional joins. This enhancement would greatly benefit users dealing with complex join scenarios. My suggestion for the syntax is as follows: Example Usage:result = pd.merge(
dfa,
dfb,
how="inner",
conditions=[
[('cola1', '>', 'colb1'), ('cola2', '<', 'colb2')],
('cola3', '=', 'colb3')
]
) Description:Condition Structure:Each condition is represented by a tuple (column, operator, column). For example: [('cola1', '>', 'colb1'), ('cola2', '<', 'colb2')], This indicates an OR condition where either cola1 in dfa should be greater than colb1 in dfb, or cola2 in dfa should be less than colb2 in dfb. |
In pandas, it is only possible to perform joins based on conditional statements that consider whether they provided keys are equal. It would be great if pandas provided a way in which joins can be performed based on some other logical condition. This is possible in many SQL engines and allows for elegant solution of specific use cases.
Examples
Setup
Current behavior (equal joins)
Based on conditional
df1.col_a == df2.col_b
Desired behavior
Not equal to join
Based on conditional
df1.col_a != df2.col_a
Greater than join
Based on conditional
df1.col_a > df2.col_a
And any other conditional based on logic operators.
Some considerations
Currently, it it is possible with pandas to join on multiple keys. It would be nice if it is possible to provide multiple logical conditions to perform joining on multiple keys, and allow "mixing" of the logical conditions (i.e. join on
df1.col_a != df2.col_a & df1.col_b == df1.col_c
). Note also that the second conditional is applied to columns of different names (i.e.col_b
ofdf1
andcol_c
ofdf2
).Real-world scenario
Such joins can be very useful in dataframes that have datetime-like objects stored in them. Consider a dataframe
df_process
that containsprocess_id
process_start_date
andprocess_end_date
columns for some business process, and a second dataframedf_events
that contains aevent_id
,event_date
column for particular events. If you want to find all events that occur during some business process, you can easily obtain these by applying the conditional join statement:df_event.event_date >= df_process.process_start_date & df_event.event_date <= df_process.process_end_date
Implementation suggestion
I'm not familiar with how joins are performed in pandas at the moment, but it seems to me that some equality check must be performed to decide to join particular records from the dataframes involved in the join. That means (if this is correct) that there is already some conditional built-in that evaluates to
True
orFalse
based on the equality (==
) comparison. My request should allow passing a different conditional to this part of the pandas join machinery.Hopefully it's clear what I am requesting!
The text was updated successfully, but these errors were encountered: