Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Do not require to sort entire DF if by option used in merge_asof #49816

Open
1 of 3 tasks
filippzorin opened this issue Nov 21, 2022 · 3 comments
Open
1 of 3 tasks
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@filippzorin
Copy link

filippzorin commented Nov 21, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

As a pandas user I'd like to have following behavior when using merge_asof.
Right now it requires to sort entire DF, but it looks like there is no need to do that if by option used.
Let me try to explain with example:
If we have 2 dataframes:

main_df = pd.DataFrame({'id': ['1', '1', '1', '2', '2'], 'tracking': [1, 4, 7, 1, 5]})

  id  tracking
0  1         1
1  1         4
2  1         7
3  2         1
4  2         5

measurements = pd.DataFrame({'id': ['1', '1'], 'position': [2, 5], 'value': [100, 150]})

  id  position  value
0  1         2    100
1  1         5    150

And we want to use merge_asof to join them

pd.merge_asof(
    left=main_df, 
    right=measurements, 
    by='id', 
    left_on='tracking', 
    right_on='position', 
    direction='nearest')

Since left df is not sorted we face error:
ValueError: left keys must be sorted

So we need to sort left df first:

pd.merge_asof(
    left=main_df.sort_values('tracking'),
    right=measurements,
    by='id',
    left_on='tracking',
    right_on='position',
    direction='nearest'
)

But the sort order in result not so obvious as origin sort where each segment with given id was sorted independent.

0  1         1       2.0  100.0
1  2         1       NaN    NaN
2  1         4       5.0  150.0
3  2         5       NaN    NaN
4  1         7       5.0  150.0

Feature Description

It would be nice if pandas require sort only segment, defined by by argument in merge_asof function.

Alternative Solutions

Haven't seen any alternatives.

Additional Context

No response

@filippzorin filippzorin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 21, 2022
@samukweku
Copy link
Contributor

@filippzorin if you look at the implementation for merge_asof, the iteration is done on the non-equi join, and if there is a match, then check if the equi join matches, and then break. Crude interpretation of what happens. So, the search doesnt happen on the equi first, but on the non-equi

@filippzorin
Copy link
Author

@filippzorin if you look at the implementation for merge_asof, the iteration is done on the non-equi join, and if there is a match, then check if the equi join matches, and then break. Crude interpretation of what happens. So, the search doesnt happen on the equi first, but on the non-equi

@samukweku, thanks for the comment. I thinks I got your point right. I looked into implementation of merge_asof and yes, processing of by condition goes after the on condition, but in my opinion it should be vise versa, i.e. it should process by condition first and after that - process on condition and checks constraints related to sorted values and null values. What do you think about it?

@samukweku
Copy link
Contributor

samukweku commented Nov 22, 2022

@filippzorin that requires a change to the libjoin cython code, which I do not think is trivial.however, it could provide performance boost and possibly even make inequality joins easier in Pandas. Hopefully one of the core Devs can speak more on this

@phofl phofl added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Dec 5, 2022
@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

4 participants