76 refactor preprocesspy module for better maintainability#106
Open
idanmoradarthas wants to merge 31 commits intomasterfrom
Open
76 refactor preprocesspy module for better maintainability#106idanmoradarthas wants to merge 31 commits intomasterfrom
idanmoradarthas wants to merge 31 commits intomasterfrom
Conversation
…ther modules for improved organization and maintainability.
- Introduced _plot_categorical.py for visualizing categorical data with functions for bar charts and categorical comparisons. - Added _plot_datetime.py for datetime visualizations, including heatmaps and comparisons with numeric features. - Created _plot_formatters.py for date formatting in plots. - Implemented _plot_numeric.py for numeric feature visualizations, including violin plots. - Added utility functions in _plot_utils.py for handling categorical-like data and series copying. These additions enhance the visualization capabilities of the preprocessing module, improving data analysis workflows.
…ies. This update enhances the testing framework by introducing new test files for statistics and visualization functionalities.
- Removed the old test_preprocess.py file and replaced it with a new conftest.py for shared fixtures and a dedicated test_statistics.py for statistical functions. - Introduced test_visualization.py to cover visualization functionalities, enhancing the overall testing framework for the preprocessing module.
- Introduced multiple new baseline images for testing various visualization scenarios in the preprocessing module. - Updated the test_visualization.py to reflect the new directory structure for baseline images, ensuring proper organization and accessibility for future tests.
- Enhanced documentation by adding descriptive docstrings to various modules within the preprocess package, including utilities for plotting categorical, datetime, numeric features, and statistical functions. - Updated the test suite with docstrings for clarity on the purpose of each test file, improving maintainability and understanding of the codebase.
…label_basic_functionality
- Added descriptions for the new submodules: visualization and statistics. - Updated import statements in usage examples to align with the new structure. - Enhanced clarity on the purpose of the preprocess module in the documentation.
- Adjusted sys.path in conf.py to ensure correct package imports. - Updated index.rst to point to the new preprocess submodule index. - Changed autofunction references in math_utils.rst, metrics.rst, strings.rst, unsupervised.rst, and xai.rst to include the ds_utils namespace. - Removed the obsolete preprocess.rst file, streamlining the documentation structure.
…nto dedicated files - Removed the monolithic metrics.py file and distributed its content across several new modules: confusion_matrix.py, curves.py, learning_curves.py, and probability_analysis.py. - Introduced an __init__.py file to facilitate module imports. - This restructuring enhances code maintainability and clarity, allowing for more focused development and testing of individual metric functionalities.
- Introduced descriptive docstrings in the metrics module files: __init__.py, confusion_matrix.py, curves.py, learning_curves.py, and probability_analysis.py. - Enhanced clarity on the purpose and functionality of each module, aiding in maintainability and understanding of the codebase.
- Updated import statements in confusion_matrix.py, curves.py, learning_curves.py, and probability_analysis.py for improved organization. - Introduced new test files for confusion matrix, curves, learning curves, and probability analysis, enhancing test coverage and maintainability. - Added baseline images for various visualization tests to ensure consistent results. - Implemented fixtures and setup logic in conftest.py for streamlined testing processes.
- Adjusted import paths in test_curves.py to align with the refactored metrics module, specifically changing the import statements for roc_curve and roc_auc_score to the new curves submodule. - Ensured that all mock patches in the tests are consistent with the updated module organization, enhancing maintainability and clarity of the test suite.
- Updated the resource path for loading class_with_probabilities.csv in test_probability_analysis.py to use a defined RESOURCES_DIR variable, improving code readability and maintainability.
…t_curves.py - Added logic to create the result directory if it does not exist, improving test setup reliability.
- Added descriptions for the focused submodules within the metrics module: confusion_matrix, curves, learning_curves, and probability_analysis. - Updated import statements in usage examples to align with the new submodule structure. - Corrected image paths to match the new organization, enhancing clarity and usability of the documentation.
- Updated index.rst to reflect the new organization of the metrics module by linking to the metrics/index instead of the now-deleted metrics.rst file. - Removed the obsolete metrics.rst file to streamline the documentation and improve clarity.
- Removed the existing test_xai.py file to streamline the test structure. -Move test files for decision paths, drawing trees, and plotting feature importance, enhancing test coverage and organization. - Move baseline images for visual tests to ensure consistent results. - Move shared fixtures in conftest.py for improved test setup. - Updated the __init__.py file to define the test suite for the xai module.
- Corrected image paths in README.md and xai.rst to reflect the new directory structure for baseline images, ensuring accurate references for visualizations. - This change enhances the clarity and usability of the documentation.
- Move new baseline images for the plot_correlation_dendrogram, plot_features_interaction, and visualize_correlations functions to ensure consistent visual outputs. - Split tests/test_preprocess/test_visualization.py to test_visualize_feature.py, test_visualize_correlations.py, test_plot_correlation_dendrogram.py, test_plot_relationship_between_features.py, test_plot_features_interaction.py
…istency - Corrected image paths in README.md and visualization.rst to reflect the new directory structure for baseline images, ensuring accurate references for visualizations. - This change enhances the clarity and usability of the documentation.
- Expanded the docstring in __init__.py to include detailed descriptions of the available functions for evaluating and visualizing model performance, improving clarity for users. - Updated import statements in _plot_formatters.py to use FuncFormatter directly, streamlining the code.
15 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor preprocess module for better maintainability
Summary
Refactors the
ds_utils.preprocessmodule from a monolithic structure into a modular package with clear separation of concerns. This improves maintainability, testability, and makes the codebase easier to navigate and extend.Motivation
Changes
New Package Structure
The preprocess functionality is now organized as a package with the following modules:
visualization.pyvisualize_feature,visualize_correlations,plot_correlation_dendrogram,plot_features_interactionstatistics.pyget_correlated_features,extract_statistics_dataframe_per_label,compute_mutual_information_plot_categorical.py_plot_datetime.py_plot_numeric.py_plot_utils.py_copy_series_or_keep_top_10,_is_categorical_like)_plot_formatters.py_convert_numbers_to_datesfor Matplotlib)Design Decisions
_are implementation details and not part of the public API_plot_utilsand_plot_formattersto avoid duplicationBackward Compatibility
Documentation, tests, and README examples have been updated to reflect the new structure where applicable.
Testing
tests/test_preprocess/pass without modificationChecklist
_prefix