Capstone project for RSschool ml-course
This project uses Forest CoverType dataset.
This package allows you to train model for forest cover type prediction.
- Clone this repository to your machine
- Download Forest CoverType dataset,
extract it to
data/raw/in directory's root - Use Python 3.9 and Poetry 1.1.11
- Install project dependencies:
poetry install --no-dev- Run train with the following command:
poetry run train -d <path to csv with data> -s <path to save trained model>You can pass many other options(select model and choose hyperparameters) in the CLI To get full list run this:
poetry run train --help- Run MLflow to see tracked experiments(models, perameters and metrics):
poetry run mlflow uiHere are the results of running 2 models with different parameters and two feature
engineering techniques.
(Because my machine was so slow i had change logistic regression for KNN and use the simplest
of approaches for feature selection and even then,
as you can see from mlflow screenshot the evaluations were just painfully long.
But I ran some experiments in Colab e.g. tried LogisticRegression L1-regularized feature elimination and it didn't show wonderful results on used dataset, so some further research and experiments TBD)
7. You can check --find-best-params=True to automatically find best model parameters(using randomized search)

Install all requirements (including dev requirements) to poetry environment:
poetry install
Now you can use developer instruments.
I also added pandas-profiling to dev-dependencies because it takes too long to install.
You can run generate-eda-report.py to get profiling of the dataset, it will be stored in
report/ folder
Format your code with black formatter:
poetry run black src tests
Check if your code is PEP8 compliant with Flake8:
poetry run flake8 src/
