-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
The ds_utils/preprocess.py module has grown to ~650 lines and contains multiple distinct responsibilities, making it harder to navigate and maintain. While the code quality is good and functions are well-structured, the module would benefit from being split into more focused submodules.
Current Issues
- Mixed Responsibilities: The module handles both visualization and statistical computation functions
- Large Number of Helper Functions: 10+ private helper functions for plotting make the file harder to navigate
- Two Distinct Domains:
- Visualization functions (
visualize_feature,visualize_correlations,plot_*) - Statistical/data processing functions (
extract_statistics_dataframe_per_label,compute_mutual_information,get_correlated_features)
- Visualization functions (
Proposed Solution
Refactor preprocess.py into a package structure:
ds_utils/
├── preprocess/
│ ├── __init__.py # Re-export all public functions for backward compatibility
│ ├── visualization.py # All public visualization functions
│ ├── statistics.py # All statistical computation functions
│ └── _plot_helpers.py # Private plotting utility functions
Module Breakdown
visualization.py
visualize_feature()visualize_correlations()plot_correlation_dendrogram()plot_features_interaction()
statistics.py
extract_statistics_dataframe_per_label()compute_mutual_information()get_correlated_features()
_plot_helpers.py
_plot_clean_violin_distribution()_plot_datetime_heatmap()_is_categorical_like()_plot_categorical_feature1()_plot_xy()_plot_datetime_feature1()_plot_numeric_features()_plot_categorical_vs_categorical()_plot_categorical_vs_datetime()_plot_categorical_vs_numeric()_copy_series_or_keep_top_10()_convert_numbers_to_dates()
__init__.py
"""Data preprocessing utilities."""
from ds_utils.preprocess.statistics import (
compute_mutual_information,
extract_statistics_dataframe_per_label,
get_correlated_features,
)
from ds_utils.preprocess.visualization import (
plot_correlation_dendrogram,
plot_features_interaction,
visualize_correlations,
visualize_feature,
)
__all__ = [
"compute_mutual_information",
"extract_statistics_dataframe_per_label",
"get_correlated_features",
"plot_correlation_dendrogram",
"plot_features_interaction",
"visualize_correlations",
"visualize_feature",
]Benefits
- Better Organization: Clear separation between visualization and statistics
- Easier Navigation: Smaller files (~150-250 lines each) are easier to read
- Backward Compatibility: Re-exporting from
__init__.pyensures existing imports continue to work - Improved Maintainability: Changes to visualization logic won't affect statistics code and vice versa
- Better Testing: Test file can mirror the structure for better organization (optional follow-up)
Implementation Checklist
- Create
ds_utils/preprocess/directory - Create
statistics.pywith statistical functions - Create
visualization.pywith visualization functions - Create
_plot_helpers.pywith private helper functions - Create
__init__.pywith re-exports for backward compatibility - Update imports in
statistics.pyandvisualization.py(e.g.,from ds_utils.preprocess._plot_helpers import ...) - Delete original
ds_utils/preprocess.pyfile - Run all tests to ensure no regressions (
pytest tests/test_preprocess.py -v) - Update any internal imports if needed
- (Optional) Update documentation/README if module structure is documented
Testing Strategy
All existing tests in tests/test_preprocess.py should pass without modification due to the re-exports in __init__.py. The imports:
from ds_utils.preprocess import (
compute_mutual_information,
extract_statistics_dataframe_per_label,
get_correlated_features,
plot_correlation_dendrogram,
plot_features_interaction,
visualize_correlations,
visualize_feature,
)will continue to work exactly as before.
Notes
- This is a pure refactoring with no functional changes
- All public APIs remain unchanged
- Consider this before adding significant new functionality to prevent the module from growing further
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request