Skip to content

Conversation

@samukweku
Copy link
Collaborator

PR Description

  • rewrite loop functions in rust for more perf
  • gradual deprecation of numba (rust replacemnt, can run tests confidently in rust,)

This PR relates to #1497 .

Please tag maintainers to review.

@samukweku samukweku self-assigned this Dec 23, 2025
@github-actions
Copy link

github-actions bot commented Dec 23, 2025

PR Preview Action v1.6.3

🚀 View preview at
https://pyjanitor-devs.github.io/pyjanitor/pr-preview/pr-1548/

Built to branch gh-pages at 2025-12-23 14:03 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a significant refactor to replace Numba with Rust implementations for conditional join operations, aiming to improve performance and enable gradual deprecation of Numba. The changes add a new Rust dependency (janitor-rs) and restructure internal conditional join logic with new helper modules.

Key Changes:

  • Added janitor-rs>=0.3.0,<0.4 as a new dependency
  • Refactored conditional join internals to use Rust-based implementations
  • Added new test cases for keep="first" and keep="last" behavior in range joins
  • Deprecated use_numba, df_columns, and right_columns parameters (in documentation only)

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pyproject.toml Added janitor-rs dependency version constraint
pixi.lock Updated lock file with janitor-rs package and updated scipy/xarray/pyspark versions
tests/functions/test_conditional_join.py Added tests for keep="first" and keep="last", added use_numba=True to existing test
janitor/functions/conditional_join.py Refactored to use new Rust-based implementations, updated imports, documented deprecations
janitor/functions/_conditional_join/_range_indices.py New file implementing range join index computation using Rust
janitor/functions/_conditional_join/_not_equal_indices.py Updated to use np.intp consistently
janitor/functions/_conditional_join/_less_than_indices.py Refactored to use janitor_rs for index repetition
janitor/functions/_conditional_join/_helpers.py Added extensive helper functions for Rust integration
janitor/functions/_conditional_join/_greater_than_indices.py Refactored to use janitor_rs for index repetition
janitor/functions/_conditional_join/_get_indices_single_join.py New file for single condition joins
janitor/functions/_conditional_join/_get_indices_non_equi.py New file for multiple non-equi join conditions
janitor/functions/_conditional_join/_dual_non_equi.py New file for dual non-equi join implementation
.requirements/base.in Updated to use janitor_rs and unpinned pandas_flavor version
.pre-commit-config.yaml Added file size limit check (1MB)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +478 to +480
# print(df.to_dict())
# print("\n\n\n")
# print(right.to_dict())
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statements should be removed before merging to production. These commented-out print statements appear to be leftover debugging code.

Suggested change
# print(df.to_dict())
# print("\n\n\n")
# print(right.to_dict())

Copilot uses AI. Check for mistakes.
if op == _JoinOperator.STRICTLY_EQUAL.value:
eq_check = True
elif op in less_than_join_types.union(greater_than_join_types):
if op in less_than_join_types.union(greater_than_join_types):
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition logic has changed from elif to if, which means both branches can now execute when op == "==". This appears to be a bug as eq_check and le_lt_check could both be set to True for equality operators, changing the original mutually exclusive behavior.

Suggested change
if op in less_than_join_types.union(greater_than_join_types):
elif op in less_than_join_types.union(greater_than_join_types):

Copilot uses AI. Check for mistakes.
Comment on lines +699 to 700
return_matching_indices=False,
row_count=row_count,
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name has changed from return_ragged_arrays to return_matching_indices in the function call at line 699, but the function signature at line 682 still uses the old name return_ragged_arrays. This mismatch will cause a TypeError at runtime.

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +275
!!!warning "Deprecated in 0.33.0"
right_columns: Columns to select from `right` in the final output dataframe.
Column selection is based on the
[`select`][janitor.functions.select.select] syntax.
!!!warning "Deprecated in 0.33.0"
use_numba: Use numba, if installed, to accelerate the computation.
!!!warning "Deprecated in 0.33.0"
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecation warnings in the docstring are using incorrect syntax. The !!!warning directive should be !!! warning (with a space after the exclamation marks) to properly render in MkDocs/Material for proper documentation display.

Suggested change
!!!warning "Deprecated in 0.33.0"
right_columns: Columns to select from `right` in the final output dataframe.
Column selection is based on the
[`select`][janitor.functions.select.select] syntax.
!!!warning "Deprecated in 0.33.0"
use_numba: Use numba, if installed, to accelerate the computation.
!!!warning "Deprecated in 0.33.0"
!!! warning "Deprecated in 0.33.0"
right_columns: Columns to select from `right` in the final output dataframe.
Column selection is based on the
[`select`][janitor.functions.select.select] syntax.
!!! warning "Deprecated in 0.33.0"
use_numba: Use numba, if installed, to accelerate the computation.
!!! warning "Deprecated in 0.33.0"

Copilot uses AI. Check for mistakes.
Comment on lines +404 to 407
# deprecate in a future version
check("use_numba", use_numba, [bool])

check("indicator", indicator, [bool, str])
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the documentation states that use_numba, df_columns, and right_columns are deprecated in version 0.33.0, there are no actual deprecation warnings being raised in the code. Users should receive a DeprecationWarning or FutureWarning when using these parameters. Consider using the deprecated_kwargs decorator or adding explicit warning calls to inform users about these deprecations.

Copilot uses AI. Check for mistakes.
left_index = left_index[booleans]
left_region = left_region[booleans]
positions = positions[booleans]
right_index = right_index[positions]
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable right_index is not used.

Suggested change
right_index = right_index[positions]

Copilot uses AI. Check for mistakes.

elif (keep == "first") and (starts is not None) and (ends is None):
total = np.count_nonzero(counts_array)
left_index = janitor_rs.trim_index(
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable left_index is not used.

Copilot uses AI. Check for mistakes.
)
elif (keep == "first") and (starts is None) and (ends is not None):
total = np.count_nonzero(counts_array)
left_index = janitor_rs.trim_index(
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable left_index is not used.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants