Skip to content

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Jul 7, 2022

Please describe the changes proposed in the pull request:

  • Incremental addition of numba options for conditional_join
  • current implementation is for first and last matches for a single conditional join, and only for less than/greater than, and either of the dataframes is unsorted

Speed tests (as always, take with a pinch of salt)

#https://stackoverflow.com/q/61948103/7175713 
df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 
                    'value_1': [2,5,7,1,3,4]})

df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 
                    'value_2A': [0,3,7,12,0,2,3,1], 
                    'value_2B': [1,5,9,15,1,4,6,3]})

df1 = pd.concat([df1]*10_000, ignore_index=True)
df2 = pd.concat([df2]*200, ignore_index=True)

%%timeit
df1.conditional_join(
        df2,
        ('value_1', 'value_2A', '>'),
        how='inner',
        use_numba=True,
        keep='first',
        sort_by_appearance = False
    )
37 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df1.conditional_join(
        df2,
        ('value_1', 'value_2A', '>'),
        how='inner',
        keep='first',
        sort_by_appearance = False
    )
179 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numba_result = df1.conditional_join(
        df2,
        ('value_1', 'value_2A', '>'),
        how='inner',
        use_numba=True,
        keep='first',
        sort_by_appearance = False
    )

no_numba = df1.conditional_join(
        df2,
        ('value_1', 'value_2A', '>'),
        how='inner',
        keep='first',
        sort_by_appearance = False
    )

numba_result.equals(no_numba)
True

This PR relates #1102 .

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@samukweku
Copy link
Collaborator Author

If anyone has some time, I need help on conditional import; my initial attempt with an if clause fails during testing with a name error message. Also, would love some assistance on testing with numba, if anyone has a testing framework for numba operations.

@ericmjl
Copy link
Member

ericmjl commented Jul 7, 2022

@thatlittleboy
Copy link
Contributor

I need help on conditional import; my initial attempt with an if clause fails during testing with a name error message

I suppose it's because you didn't add numba into the environment-dev.yml file? The environment used during the CI test build is set up using that file. Give that a try?

@Zeroto521
Copy link
Member

Is there a clear performance comparison?
Such as x1.x than without numba in xx data size

@samukweku
Copy link
Collaborator Author

samukweku commented Jul 8, 2022

@Zeroto521 I updated the PR to include timings - depending on the size, it is about 3x faster with numba;
@thatlittleboy thanks!

@codecov
Copy link

codecov bot commented Jul 8, 2022

Codecov Report

Merging #1131 (fef03d9) into dev (a25c821) will decrease coverage by 0.04%.
The diff coverage is 97.11%.

@@            Coverage Diff             @@
##              dev    #1131      +/-   ##
==========================================
- Coverage   97.37%   97.32%   -0.05%     
==========================================
  Files          77       77              
  Lines        3163     3217      +54     
==========================================
+ Hits         3080     3131      +51     
- Misses         83       86       +3     

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samukweku I wouldn't worry about testing the conditional import for Numba. However, I think it's important to test the line on any_null, as it seems to be one crucial code path in the functionality of the code. Hope you're okay with that, I'll mark as "Request changes" for now.

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve! Thank you, @samukweku!

@ericmjl
Copy link
Member

ericmjl commented Jul 20, 2022

If we can, let's get in one more review before merging. Otherwise, I'll merge on the weekend.

@ericmjl
Copy link
Member

ericmjl commented Jul 30, 2022

Ok, let's merge. Otherwise this PR is gonna get stale 😄.

@ericmjl ericmjl merged commit 76e7228 into dev Jul 30, 2022
@Zeroto521 Zeroto521 deleted the samukweku/cond_join_1 branch November 2, 2022 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants