-
Notifications
You must be signed in to change notification settings - Fork 180
Samukweku/cond join refactor a #1548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a significant refactor to replace Numba with Rust implementations for conditional join operations, aiming to improve performance and enable gradual deprecation of Numba. The changes add a new Rust dependency (janitor-rs) and restructure internal conditional join logic with new helper modules.
Key Changes:
- Added
janitor-rs>=0.3.0,<0.4as a new dependency - Refactored conditional join internals to use Rust-based implementations
- Added new test cases for
keep="first"andkeep="last"behavior in range joins - Deprecated
use_numba,df_columns, andright_columnsparameters (in documentation only)
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Added janitor-rs dependency version constraint |
| pixi.lock | Updated lock file with janitor-rs package and updated scipy/xarray/pyspark versions |
| tests/functions/test_conditional_join.py | Added tests for keep="first" and keep="last", added use_numba=True to existing test |
| janitor/functions/conditional_join.py | Refactored to use new Rust-based implementations, updated imports, documented deprecations |
| janitor/functions/_conditional_join/_range_indices.py | New file implementing range join index computation using Rust |
| janitor/functions/_conditional_join/_not_equal_indices.py | Updated to use np.intp consistently |
| janitor/functions/_conditional_join/_less_than_indices.py | Refactored to use janitor_rs for index repetition |
| janitor/functions/_conditional_join/_helpers.py | Added extensive helper functions for Rust integration |
| janitor/functions/_conditional_join/_greater_than_indices.py | Refactored to use janitor_rs for index repetition |
| janitor/functions/_conditional_join/_get_indices_single_join.py | New file for single condition joins |
| janitor/functions/_conditional_join/_get_indices_non_equi.py | New file for multiple non-equi join conditions |
| janitor/functions/_conditional_join/_dual_non_equi.py | New file for dual non-equi join implementation |
| .requirements/base.in | Updated to use janitor_rs and unpinned pandas_flavor version |
| .pre-commit-config.yaml | Added file size limit check (1MB) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # print(df.to_dict()) | ||
| # print("\n\n\n") | ||
| # print(right.to_dict()) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statements should be removed before merging to production. These commented-out print statements appear to be leftover debugging code.
| # print(df.to_dict()) | |
| # print("\n\n\n") | |
| # print(right.to_dict()) |
| if op == _JoinOperator.STRICTLY_EQUAL.value: | ||
| eq_check = True | ||
| elif op in less_than_join_types.union(greater_than_join_types): | ||
| if op in less_than_join_types.union(greater_than_join_types): |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition logic has changed from elif to if, which means both branches can now execute when op == "==". This appears to be a bug as eq_check and le_lt_check could both be set to True for equality operators, changing the original mutually exclusive behavior.
| if op in less_than_join_types.union(greater_than_join_types): | |
| elif op in less_than_join_types.union(greater_than_join_types): |
| return_matching_indices=False, | ||
| row_count=row_count, |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter name has changed from return_ragged_arrays to return_matching_indices in the function call at line 699, but the function signature at line 682 still uses the old name return_ragged_arrays. This mismatch will cause a TypeError at runtime.
| !!!warning "Deprecated in 0.33.0" | ||
| right_columns: Columns to select from `right` in the final output dataframe. | ||
| Column selection is based on the | ||
| [`select`][janitor.functions.select.select] syntax. | ||
| !!!warning "Deprecated in 0.33.0" | ||
| use_numba: Use numba, if installed, to accelerate the computation. | ||
| !!!warning "Deprecated in 0.33.0" |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deprecation warnings in the docstring are using incorrect syntax. The !!!warning directive should be !!! warning (with a space after the exclamation marks) to properly render in MkDocs/Material for proper documentation display.
| !!!warning "Deprecated in 0.33.0" | |
| right_columns: Columns to select from `right` in the final output dataframe. | |
| Column selection is based on the | |
| [`select`][janitor.functions.select.select] syntax. | |
| !!!warning "Deprecated in 0.33.0" | |
| use_numba: Use numba, if installed, to accelerate the computation. | |
| !!!warning "Deprecated in 0.33.0" | |
| !!! warning "Deprecated in 0.33.0" | |
| right_columns: Columns to select from `right` in the final output dataframe. | |
| Column selection is based on the | |
| [`select`][janitor.functions.select.select] syntax. | |
| !!! warning "Deprecated in 0.33.0" | |
| use_numba: Use numba, if installed, to accelerate the computation. | |
| !!! warning "Deprecated in 0.33.0" |
| # deprecate in a future version | ||
| check("use_numba", use_numba, [bool]) | ||
|
|
||
| check("indicator", indicator, [bool, str]) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the documentation states that use_numba, df_columns, and right_columns are deprecated in version 0.33.0, there are no actual deprecation warnings being raised in the code. Users should receive a DeprecationWarning or FutureWarning when using these parameters. Consider using the deprecated_kwargs decorator or adding explicit warning calls to inform users about these deprecations.
| left_index = left_index[booleans] | ||
| left_region = left_region[booleans] | ||
| positions = positions[booleans] | ||
| right_index = right_index[positions] |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable right_index is not used.
| right_index = right_index[positions] |
|
|
||
| elif (keep == "first") and (starts is not None) and (ends is None): | ||
| total = np.count_nonzero(counts_array) | ||
| left_index = janitor_rs.trim_index( |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable left_index is not used.
| ) | ||
| elif (keep == "first") and (starts is None) and (ends is not None): | ||
| total = np.count_nonzero(counts_array) | ||
| left_index = janitor_rs.trim_index( |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable left_index is not used.
PR Description
This PR relates to #1497 .
Please tag maintainers to review.