This CLI tool provides a streamlined way to preprocess structured data files (CSV only) by offering various data cleaning and transformation functionalities. Users can execute individual preprocessing steps or chain multiple steps in a single command.
-
Load Data Load a dataset from a specified CSV file.
-
Handle Missing Values (
mv)- Remove Missing Values (
rm)
Removes rows containing any missing values. - Fill with Default (
fl_<value>)
Fills missing values with a specified default value (e.g.,fl_0fills missing values with 0).
- Remove Missing Values (
-
Remove Duplicates (
dp)
Removes duplicate rows from the dataset. -
Normalization & Standardization (
fs)- Normalize (
nm) - Standardize (
sd)
- Normalize (
-
Export Processed File Saves the processed dataset to a specified CSV file.
-
CLI Supports Chaining
Multiple processing steps can be applied in a single command. -
Handle Outliers by Z-score (
ol)- Remove outliers (
rm) - Replace outliers (
rp) You could choose feature to apply check outlier. (e.g., ol,rm_Age_Glucose) If you not give the feature to apply, then the tool will check outlier for all feature in csv.
- Remove outliers (
-
Encode Categorical Data (
ec)- One-Hot Encoding (
oh) - Ordinal Encoding (
od) Please noted that you need to provide feature name as parameter to start encoding process or the tools'll raise a issue. Example:
- One-Hot Encoding (
/usr/local/bin/python3 data_tools.py --pipe="mv,fl_0-fs,nm-ec,oh" ../../input.csv ../../output_directoryWill throw a request to add a feature name as params for oh.
/usr/local/bin/python3 data_tools.py --pipe="mv,fl_0-fs,nm-ec,oh_Age_Glucose" ../../input.csv ../../output_directoryThis will work.
None, all plan features finished. (Will update more if have a request)
/usr/local/bin/python3 data_tools.py --pipe="<steps>" <inputFilePath> <outputPath>,separates main and sub-services.-separates different main service lines (e.g., handling missing values, feature scaling)._separates service and parameter (e.g.,fl_0means fill missing values with 0).
Load -> missing data -> feature scaling all feature with normalization -> handle suplication -> Output to path:
/usr/local/bin/python3 data_tools.py --pipe="mv,fl_0-fs,nm-dp" ../../input.csv ../../output_directory/usr/local/bin/python3 data_tools.py --pipe="mv,fl_100-dp" ../../input.csv ../../output_directory/usr/local/bin/python3 data_tools.py --pipe="fs,sd_Age_Glucose" ../../input.csv ../../output_directoryOnly need to ensure Python is installed. Not have any other dependencies.
Contributions are welcome! Feel free to submit issues or pull requests.
MIT License