Skip to content

Comments

76 refactor preprocesspy module for better maintainability#106

Open
idanmoradarthas wants to merge 31 commits intomasterfrom
76-refactor-preprocesspy-module-for-better-maintainability
Open

76 refactor preprocesspy module for better maintainability#106
idanmoradarthas wants to merge 31 commits intomasterfrom
76-refactor-preprocesspy-module-for-better-maintainability

Conversation

@idanmoradarthas
Copy link
Owner

Refactor preprocess module for better maintainability

Summary

Refactors the ds_utils.preprocess module from a monolithic structure into a modular package with clear separation of concerns. This improves maintainability, testability, and makes the codebase easier to navigate and extend.

Motivation

  • Maintainability: A single large module is harder to understand, modify, and debug
  • Separation of concerns: Visualization logic, statistics, and internal plotting helpers have distinct responsibilities
  • Discoverability: Developers can quickly find the right module for their task
  • Extensibility: New plotting types or statistical utilities can be added without cluttering unrelated code

Changes

New Package Structure

The preprocess functionality is now organized as a package with the following modules:

Module Purpose
visualization.py Public API – High-level plotting functions: visualize_feature, visualize_correlations, plot_correlation_dendrogram, plot_features_interaction
statistics.py Public API – Statistical utilities: get_correlated_features, extract_statistics_dataframe_per_label, compute_mutual_information
_plot_categorical.py Internal – Plotting logic for categorical features (count bars, violin plots, histograms)
_plot_datetime.py Internal – Plotting logic for datetime features (heatmaps, time series, datetime vs numeric/categorical)
_plot_numeric.py Internal – Plotting logic for numeric features (violin distributions, scatter plots)
_plot_utils.py Internal – Shared utilities (_copy_series_or_keep_top_10, _is_categorical_like)
_plot_formatters.py Internal – Formatting helpers (e.g., _convert_numbers_to_dates for Matplotlib)

Design Decisions

  • Public vs internal: Modules prefixed with _ are implementation details and not part of the public API
  • Domain-based split: Plotting logic is grouped by feature type (categorical, datetime, numeric) rather than by function
  • Shared utilities: Common logic is extracted into _plot_utils and _plot_formatters to avoid duplication

Backward Compatibility

Documentation, tests, and README examples have been updated to reflect the new structure where applicable.

Testing

  • Existing tests in tests/test_preprocess/ pass without modification
  • Test coverage is preserved for all public functions

Checklist

  • Code refactored into logical modules
  • Internal modules properly namespaced with _ prefix
  • Documentation references updated
  • Tests pass

…ther modules for improved organization and maintainability.
- Introduced _plot_categorical.py for visualizing categorical data with functions for bar charts and categorical comparisons.
- Added _plot_datetime.py for datetime visualizations, including heatmaps and comparisons with numeric features.
- Created _plot_formatters.py for date formatting in plots.
- Implemented _plot_numeric.py for numeric feature visualizations, including violin plots.
- Added utility functions in _plot_utils.py for handling categorical-like data and series copying.

These additions enhance the visualization capabilities of the preprocessing module, improving data analysis workflows.
…ies. This update enhances the testing framework by introducing new test files for statistics and visualization functionalities.
- Removed the old test_preprocess.py file and replaced it with a new conftest.py for shared fixtures and a dedicated test_statistics.py for statistical functions.
- Introduced test_visualization.py to cover visualization functionalities, enhancing the overall testing framework for the preprocessing module.
- Introduced multiple new baseline images for testing various visualization scenarios in the preprocessing module.
- Updated the test_visualization.py to reflect the new directory structure for baseline images, ensuring proper organization and accessibility for future tests.
- Enhanced documentation by adding descriptive docstrings to various modules within the preprocess package, including utilities for plotting categorical, datetime, numeric features, and statistical functions.
- Updated the test suite with docstrings for clarity on the purpose of each test file, improving maintainability and understanding of the codebase.
- Added descriptions for the new submodules: visualization and statistics.
- Updated import statements in usage examples to align with the new structure.
- Enhanced clarity on the purpose of the preprocess module in the documentation.
- Adjusted sys.path in conf.py to ensure correct package imports.
- Updated index.rst to point to the new preprocess submodule index.
- Changed autofunction references in math_utils.rst, metrics.rst, strings.rst, unsupervised.rst, and xai.rst to include the ds_utils namespace.
- Removed the obsolete preprocess.rst file, streamlining the documentation structure.
…nto dedicated files

- Removed the monolithic metrics.py file and distributed its content across several new modules: confusion_matrix.py, curves.py, learning_curves.py, and probability_analysis.py.
- Introduced an __init__.py file to facilitate module imports.
- This restructuring enhances code maintainability and clarity, allowing for more focused development and testing of individual metric functionalities.
- Introduced descriptive docstrings in the metrics module files: __init__.py, confusion_matrix.py, curves.py, learning_curves.py, and probability_analysis.py.
- Enhanced clarity on the purpose and functionality of each module, aiding in maintainability and understanding of the codebase.
- Updated import statements in confusion_matrix.py, curves.py, learning_curves.py, and probability_analysis.py for improved organization.
- Introduced new test files for confusion matrix, curves, learning curves, and probability analysis, enhancing test coverage and maintainability.
- Added baseline images for various visualization tests to ensure consistent results.
- Implemented fixtures and setup logic in conftest.py for streamlined testing processes.
- Adjusted import paths in test_curves.py to align with the refactored metrics module, specifically changing the import statements for roc_curve and roc_auc_score to the new curves submodule.
- Ensured that all mock patches in the tests are consistent with the updated module organization, enhancing maintainability and clarity of the test suite.
- Updated the resource path for loading class_with_probabilities.csv in test_probability_analysis.py to use a defined RESOURCES_DIR variable, improving code readability and maintainability.
…t_curves.py

- Added logic to create the result directory if it does not exist, improving test setup reliability.
- Added descriptions for the focused submodules within the metrics module: confusion_matrix, curves, learning_curves, and probability_analysis.
- Updated import statements in usage examples to align with the new submodule structure.
- Corrected image paths to match the new organization, enhancing clarity and usability of the documentation.
- Updated index.rst to reflect the new organization of the metrics module by linking to the metrics/index instead of the now-deleted metrics.rst file.
- Removed the obsolete metrics.rst file to streamline the documentation and improve clarity.
- Removed the existing test_xai.py file to streamline the test structure.
-Move test files for decision paths, drawing trees, and plotting feature importance, enhancing test coverage and organization.
- Move baseline images for visual tests to ensure consistent results.
- Move shared fixtures in conftest.py for improved test setup.
- Updated the __init__.py file to define the test suite for the xai module.
- Corrected image paths in README.md and xai.rst to reflect the new directory structure for baseline images, ensuring accurate references for visualizations.
- This change enhances the clarity and usability of the documentation.
- Move new baseline images for the plot_correlation_dendrogram, plot_features_interaction, and visualize_correlations functions to ensure consistent visual outputs.
- Split tests/test_preprocess/test_visualization.py to test_visualize_feature.py, test_visualize_correlations.py, test_plot_correlation_dendrogram.py, test_plot_relationship_between_features.py, test_plot_features_interaction.py
…istency

- Corrected image paths in README.md and visualization.rst to reflect the new directory structure for baseline images, ensuring accurate references for visualizations.
- This change enhances the clarity and usability of the documentation.
- Expanded the docstring in __init__.py to include detailed descriptions of the available functions for evaluating and visualizing model performance, improving clarity for users.
- Updated import statements in _plot_formatters.py to use FuncFormatter directly, streamlining the code.
@idanmoradarthas idanmoradarthas self-assigned this Feb 21, 2026
@idanmoradarthas idanmoradarthas linked an issue Feb 21, 2026 that may be closed by this pull request
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor preprocess.py Module for Better Maintainability

1 participant