A generic decision tree classifier, which generates, prunes and visualises a decision tree based on an unseen dataset.
This package allows the user to build a decision tree from a previously unseen dataset. Once the tree is built the user can test the accuracy of the tree, predict the class label of an unclassified datapoint and create a labeled visualisation of the decision tree. An additional feature allows the user to split their dataset into training, test and validation sets prior to building the decision tree.
To get started with this package clone this repo:
git clone https://github.com/mwolinska/Decision-Tree
Then enter the correct directory on your machine:
cd Decision-Tree
This package uses poetry dependency manager. To install all dependencies run:
poetry install
The file structure we propose is outlined below. This was used to generate the suggested_commands
.
├── Decision-Tree
│ └──Decision_Tree
└── Decision-Tree-Data
└── Iris-Dataset
└── iris.csv
Currently, this package only accepts datasets in csv format, where the following conditions need to be met:
- data labels are the last column in string format,
- feature labels are in the first row of the data
- feature data is numerical The dataset can be saved anywhere as it is passed as an argument.
The cli is triggered by using the decision-tree command, which launches the cli script. The cli has 3 available commands:
An example run using the iris dataset is outlined below.
This function takes a full dataset (in csv format), separates it into training, validation and test sets. It then generates a decision tree based on the training data. It has optional arguments as outlined below:
To create a decision tree based on the iris.csv dataset and save it as "iris_decision_tree.pickle" the following command can be run:
decision-tree run <your-csv-file> <path-to-save-your-tree>/decision_tree.pickle
To set either the prune or draw-tree variables, use one the following syntaxes:
decision-tree run <your-csv-file> -p False -d <path-to-save-your-visuals>/<desired-folder-name>/
Or:
decision-tree run <your-csv-file> --prune False --draw-tree <path-to-save-your-visuals>/<desired-folder-name>/
Once a run is completed, if the draw-tree argument was set to True the decision tree will be saved under "tree_visual.pdf" in the project directory. If the feature and label names are added to the training dataset, those are included in the tree visualisation. The tree generated using the run above would look like this:
If the prune variable is set to True the pruned tree visualisation will be saved under "pruned_tree.pdf" in the project directory. For this run it would look like this:
If the feature names are not included in the dataset the tree will be labeled using column indices as feature numbers. This image is generated using a different run than those above.
The load command allows the user to load an existing decision tree (in pickle format) and generate predictions for a dataset. The required arguments are as below:
An example run would look like this:
decision-tree load <your-pickle-file> <your-samples-csv-file> <path-to-save-your-predictions>/predictions.csv
Default command to view available command.
A docker image of the package is available here.
To download the docker image run:
docker pull mwolinska/decision-tree:latest
To load and save data outside of the docker image it is necessary to mount a directory from your machine into the docker image. The following command runs the decision-tree run command, saves the output and generates the visuals.
docker run \
-v $(pwd)/Decision-Tree-Data:/workdir/All-Data \
-it mwolinska/decision-tree:latest \
run /workdir/All-Data/Iris-Dataset/iris.csv /workdir/All-Data/Iris-Dataset/test.pickle \
-d /workdir/All-Data/Iris-Dataset/visual/
This will result in the following files being generated:
Decision-Tree-Data
└── Iris-Dataset
├── iris.csv
├── test.pickle
└── visual
├── pruned_tree
├── pruned_tree.pdf
├── unpruned_tree_visual
└── unpruned_tree_visual.pdf