Imputing missing values using KNN-imputation approach and then fitting the data using RandomForestRegressor
Real world is full of missing values ! Either we ignore the samples that has missing values or we impute them
Objective: To work with dataset that has missing values, impute them and apply classifications algorithm to compare performance with and without impuation.
Dataset:
The Public 2020 Stack Overflow Developer Survey Results
https://insights.stackoverflow.com/survey
- Provide proper documentaiton
- Dataset characteristics
- Try different imputation methods (for example, mean, median
- Try other Regression algorithms
- Provide the results of using different impuation methods and no impuation (Comparison)
- Select the features based on feature imporatnce (now it is selected intuitively)
- Dive deep into if it makes sense to encode the values of some features
Reference:
https://www.youtube.com/watch?v=xl0N7tHiwlw
https://stackoverflow.com/questions/54444260/labelencoder-that-keeps-missing-values-as-nan (for encoding the non-numerical feature values while keeping the missing values as missing)