Skip to content

Refactor preprocess.py Module for Better Maintainability #76

@idanmoradarthas

Description

@idanmoradarthas

Problem

The ds_utils/preprocess.py module has grown to ~650 lines and contains multiple distinct responsibilities, making it harder to navigate and maintain. While the code quality is good and functions are well-structured, the module would benefit from being split into more focused submodules.

Current Issues

  1. Mixed Responsibilities: The module handles both visualization and statistical computation functions
  2. Large Number of Helper Functions: 10+ private helper functions for plotting make the file harder to navigate
  3. Two Distinct Domains:
    • Visualization functions (visualize_feature, visualize_correlations, plot_*)
    • Statistical/data processing functions (extract_statistics_dataframe_per_label, compute_mutual_information, get_correlated_features)

Proposed Solution

Refactor preprocess.py into a package structure:

ds_utils/
├── preprocess/
│   ├── __init__.py           # Re-export all public functions for backward compatibility
│   ├── visualization.py      # All public visualization functions
│   ├── statistics.py         # All statistical computation functions
│   └── _plot_helpers.py      # Private plotting utility functions

Module Breakdown

visualization.py

  • visualize_feature()
  • visualize_correlations()
  • plot_correlation_dendrogram()
  • plot_features_interaction()

statistics.py

  • extract_statistics_dataframe_per_label()
  • compute_mutual_information()
  • get_correlated_features()

_plot_helpers.py

  • _plot_clean_violin_distribution()
  • _plot_datetime_heatmap()
  • _is_categorical_like()
  • _plot_categorical_feature1()
  • _plot_xy()
  • _plot_datetime_feature1()
  • _plot_numeric_features()
  • _plot_categorical_vs_categorical()
  • _plot_categorical_vs_datetime()
  • _plot_categorical_vs_numeric()
  • _copy_series_or_keep_top_10()
  • _convert_numbers_to_dates()

__init__.py

"""Data preprocessing utilities."""

from ds_utils.preprocess.statistics import (
    compute_mutual_information,
    extract_statistics_dataframe_per_label,
    get_correlated_features,
)
from ds_utils.preprocess.visualization import (
    plot_correlation_dendrogram,
    plot_features_interaction,
    visualize_correlations,
    visualize_feature,
)

__all__ = [
    "compute_mutual_information",
    "extract_statistics_dataframe_per_label",
    "get_correlated_features",
    "plot_correlation_dendrogram",
    "plot_features_interaction",
    "visualize_correlations",
    "visualize_feature",
]

Benefits

  1. Better Organization: Clear separation between visualization and statistics
  2. Easier Navigation: Smaller files (~150-250 lines each) are easier to read
  3. Backward Compatibility: Re-exporting from __init__.py ensures existing imports continue to work
  4. Improved Maintainability: Changes to visualization logic won't affect statistics code and vice versa
  5. Better Testing: Test file can mirror the structure for better organization (optional follow-up)

Implementation Checklist

  • Create ds_utils/preprocess/ directory
  • Create statistics.py with statistical functions
  • Create visualization.py with visualization functions
  • Create _plot_helpers.py with private helper functions
  • Create __init__.py with re-exports for backward compatibility
  • Update imports in statistics.py and visualization.py (e.g., from ds_utils.preprocess._plot_helpers import ...)
  • Delete original ds_utils/preprocess.py file
  • Run all tests to ensure no regressions (pytest tests/test_preprocess.py -v)
  • Update any internal imports if needed
  • (Optional) Update documentation/README if module structure is documented

Testing Strategy

All existing tests in tests/test_preprocess.py should pass without modification due to the re-exports in __init__.py. The imports:

from ds_utils.preprocess import (
    compute_mutual_information,
    extract_statistics_dataframe_per_label,
    get_correlated_features,
    plot_correlation_dendrogram,
    plot_features_interaction,
    visualize_correlations,
    visualize_feature,
)

will continue to work exactly as before.

Notes

  • This is a pure refactoring with no functional changes
  • All public APIs remain unchanged
  • Consider this before adding significant new functionality to prevent the module from growing further

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions