Skip to content

A generic decision tree classifier, which generates, prunes and visualises a decision tree based on an unseen dataset.

Notifications You must be signed in to change notification settings

mwolinska/Decision-Tree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Decision-Tree

A generic decision tree classifier, which generates, prunes and visualises a decision tree based on an unseen dataset.

Release version CircleCI DockerHub

Introduction

This package allows the user to build a decision tree from a previously unseen dataset. Once the tree is built the user can test the accuracy of the tree, predict the class label of an unclassified datapoint and create a labeled visualisation of the decision tree. An additional feature allows the user to split their dataset into training, test and validation sets prior to building the decision tree.

Getting started with the package

To get started with this package clone this repo:

git clone https://github.com/mwolinska/Decision-Tree

Then enter the correct directory on your machine:

cd Decision-Tree

This package uses poetry dependency manager. To install all dependencies run:

poetry install

Using the package

File structure

The file structure we propose is outlined below. This was used to generate the suggested_commands

.
├── Decision-Tree
│   └──Decision_Tree
└── Decision-Tree-Data
    └── Iris-Dataset
        └── iris.csv

Dataset format

Currently, this package only accepts datasets in csv format, where the following conditions need to be met:

  • data labels are the last column in string format,
  • feature labels are in the first row of the data
  • feature data is numerical The dataset can be saved anywhere as it is passed as an argument.

Available commands

The cli is triggered by using the decision-tree command, which launches the cli script. The cli has 3 available commands:

An example run using the iris dataset is outlined below.

Run command

This function takes a full dataset (in csv format), separates it into training, validation and test sets. It then generates a decision tree based on the training data. It has optional arguments as outlined below:

To create a decision tree based on the iris.csv dataset and save it as "iris_decision_tree.pickle" the following command can be run:

decision-tree run <your-csv-file> <path-to-save-your-tree>/decision_tree.pickle

To set either the prune or draw-tree variables, use one the following syntaxes:

decision-tree run <your-csv-file> -p False -d <path-to-save-your-visuals>/<desired-folder-name>/

Or:

decision-tree run <your-csv-file> --prune False --draw-tree <path-to-save-your-visuals>/<desired-folder-name>/

Once a run is completed, if the draw-tree argument was set to True the decision tree will be saved under "tree_visual.pdf" in the project directory. If the feature and label names are added to the training dataset, those are included in the tree visualisation. The tree generated using the run above would look like this:

If the prune variable is set to True the pruned tree visualisation will be saved under "pruned_tree.pdf" in the project directory. For this run it would look like this:

If the feature names are not included in the dataset the tree will be labeled using column indices as feature numbers. This image is generated using a different run than those above.

Load command

The load command allows the user to load an existing decision tree (in pickle format) and generate predictions for a dataset. The required arguments are as below:

An example run would look like this:

decision-tree load <your-pickle-file> <your-samples-csv-file> <path-to-save-your-predictions>/predictions.csv

Help command

Default command to view available command.

Using the package with docker

A docker image of the package is available here.

To download the docker image run:

 docker pull mwolinska/decision-tree:latest

To load and save data outside of the docker image it is necessary to mount a directory from your machine into the docker image. The following command runs the decision-tree run command, saves the output and generates the visuals.

docker run \
  -v $(pwd)/Decision-Tree-Data:/workdir/All-Data \
  -it mwolinska/decision-tree:latest \
  run /workdir/All-Data/Iris-Dataset/iris.csv /workdir/All-Data/Iris-Dataset/test.pickle \
  -d /workdir/All-Data/Iris-Dataset/visual/

This will result in the following files being generated:

Decision-Tree-Data
└── Iris-Dataset
    ├── iris.csv
    ├── test.pickle
    └── visual
        ├── pruned_tree
        ├── pruned_tree.pdf
        ├── unpruned_tree_visual
        └── unpruned_tree_visual.pdf

About

A generic decision tree classifier, which generates, prunes and visualises a decision tree based on an unseen dataset.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published