GitHub - r0f1/ml_checklist: Explorative Data Analysis Guidelines

Get data into a usable format!
Find out if the following predictive modeling phase will be successful!

Combine everything into a single big table
- Convert files to .csv
- Merge files
- Fix encoding issues
- Clean column names (english, no whitespace, no special chars)
- Are there duplicate columns?
- Fix datatypes (datetime, int, float, string)
Look at the raw data
- Sort data
- Filter data by various criteria
Investigation
- Non-sensical observations/artifacts?
- Coding of categorical features?
- Missing values?
- Outliers?
- Constant values (=Zero Importance)?
- Low importance features?
- Collinear, correlated or otherwise dependent features?
- Highly skewed features?
- Irrelevant features?
Univariate Analysis
- Look at mean, median, min, max, std, iqr, quantiles (1%, 5%, 25%, 50%, 75%, 95%, 99%)
- Draw boxplots, histograms
Multivariate Analysis
- Draw scatter plots
- Create correlation matrix
Time Series? -> Plot variables over time
Fixing issues
- Impute missing values (mode, median, mean)
- Remove variables that have too many missings
- Remove observations that have too many missings
- Select appropriate time slice
Preparation
- Clip values that are too small/too large
- Scale to [0,1] or normalize (mean=0, std=1) or Robust / Quantile Scaling
- One-hot encoding, Label Encoding (0,1,2,3)
- Create log-transformed versions for highly skewed variables
- Create binned versions for variables
- Combine categories for highly skewed categorical variables
- Create sum/difference/product/quotient of variables
- Create polynomial features

Storing training data (browsable training data catalog)
Storing ground-truth annotations (when, by whom, which annotation guidelines)
Pre-processing steps
Training: Architecture + Loss function + Learning algorithm
Storing models and possibly ancestors (in-case not training from scratch)
Storing training hyperparameters, training environment (installed packages, CUDA drivers, VM machine images)
Storing loss function history and final performance metrics on test set.
Tools: Weights and Biases, Neptune, Iterative

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback