In this case is possible to work through a study regression predictive modeling problem in Python including each step of the applied machine learning process. Some steps of the process:
- How to use data transforms to improve model performance.
- How to use algorithm tunning to improve model performance.
- How to use ensemble methods and tunning of ensemble methods to improve model performance.
For this project the Boston House Price dataset was investigated. Each record in the dataset describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Area (SMSA) in 1970.
The attributes are defined as follows (taken from the UCI Machine Learning Repository: http://lib.stat.cmu.edu/datasets/boston):
- CRIM: per capita crime rate by town
- ZN: proportion of residencial land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centers
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk − 0:63)2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes in $1000s
- Problem Definition (Boston house price data).
- Loading the Dataset.
- Analyze Data (some skewed distributions and correlated attributes).
- Evaluate Algorithms (Linear Regression looked good).
- Evaluate Algorithms with Standardization (KNN looked good).
- Algorithm Tuning (K=3 for KNN was best).
- Ensemble Methods (Bagging and Boosting, Gradient Boosting looked good).
- Tuning Ensemble Methods (getting the most from Gradient Boosting).
- Finalize Model (use all training data and confirm using validation dataset)