To run this project some python libraries are required. These are defined in requirements.txt.
The recommended way to install them is to use conda, or pip, and virtual environments.
With Anaconda installed, execute the following in a anaconda command prompt :
-
Change to this project's folder
$ cd "path/to/folder"
-
Create a virtual environment.
$ conda create -y --name env-name
-
Activate your new environment.
$ conda activate env-name
-
Install the required libraries.
$ conda install --force-reinstall -y --name env-name -c conda-forge --file requirements.txt
With Python 3+ and pip installed, execute the following in a command prompt :
-
Change to this project's folder
$ cd "path/to/folder"
-
Create a virtual environment.
$ py -m venv env-name
-
Activate your new environment.
$ .\env-name\Scripts\activate
-
Install the required libraries.
$ pip install -r requirements.txt
Using conda is preferred, pip may fail to install libraries like shap in Windows, if some microsoft visual studio resources are not already installed.
Once you have the libraries installed you can edit the source code and interact with the notebooks yourself.
To open the notebooks, with the virtual environment activated, in this project's folder, execute:
$ jupyter notebook
This should open a browser tab in which you can select and run the notebooks.
The initial notebooks are for data exploration, they explore the data's quality and help to understand the features's overall distribution before going into predictions.
Here models are built, tested and interpreted.
It is possible to rerun the prediction process step by step.
To do so, trough the notebook IDE in notebook #2:
-
Restart the kernel, keeping only markdowns and source code:
Go to Kernel > Restart & Clear Outputs
-
Start executing cells:
Click '>| Run' to execute selected cell's source code
The notebook is designed as to allow for the easy testing of different models.
With little code edits it is possible to perform feature and classifier selection.
To test the impact of feature selection, edit Cell 3, taking features in and out of the corresponding list.
To use a different classifier import it from scikit's module and use it as clf, like so:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
When defining the classifier it is also possible to use different hyper-parameters, using its optional arguments.
This way, with little programming experience, one can build, test and validate its own models and get a good grasp of the machine learning workflow.