Distributed Asynchronous Boosting Tree (DAB Tree)

Team: Qin Lyu, Bingfeng Xia, MingLong Wu, Hengte Lin

Introduction

The goal of this project is to apply parallel computation techniques taught in Harvard CS205 course to a data science or a computational science problem. In general, by massive parallel computation, an algorithm is expected to accommodate larger datasets or to perform large scale computation efficiently.

Gradient boosting decision tree (GBDT) is a machine learning technique that can be used for regression and classification problem [1]. GBDT was proposed by Jerome H. Friedman [2,3] that constructs additive model using decision tree as a weak learner. Advantages of GBDT include its capability to model feature interaction and to perform inherent feature selection.

GBDT has been applied to wide varities of applications including physics [4], computer vision [5], and web-search ranking [6, 7]. To cope with increased scale of datasets, a distributed and parallel version of GBDT will be implemented in this project. The scaling property of the implemented system will be tested based on a real world application.

What is a Tree

Tree the a kind of prediction model based on thresholds. According to WikiPedia, "A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm."

Here is an example:

What is a Gradient Boosting Tree

First One need to know what is Ensamble. Ensamble is a technology to group some models and make them vote for the final result. There are two main kinds of ensamble, bagging and boosting. In bagging, models are indipendent of each other. In Gradient Boosting, models interatively learn what's left from the previous model.

Goal

A distributed asynchronous stochastic gradient boosting system will be implemented and be tested on a HPC cluster. In addition, parallel computing will be used to accelerate training models on each node.

Design

Data can be first divided into packs and saved as google protobuf files. Then protobuf files are read by trainers in batches. Trainers will iteratively grow nodes on a tree until reach limit. Then trainer will read next data batch and train the next tree.

Pipeline of training one tree

Local Training Process

Both data preparation and tree node growing are parallelized using c++ threading.

MPI Method

The algorithm can operate on an OpenMPI cluster. The cluster was designed to have one Master and several Workers. Workers are the nodes that actually read data and create trees, while Master act as a trading center to asynchronously exchange trees among Workers.

Batch Delivery Mode

To reduce the communication overhead while trading trees with Master, DABTREE supports the option to deliever trees in batches. Delivering trees in batches will significantly reduce the cost of communication, with the cost of reducing accuracy.

Implementation

A Gradient boosting decision tree algorithm is implemented with C++ using serial computation approach. A parallel algorithm using MPI and c++ threading is also implemented. Initial tests are performed by comparing training time to existing software packages including GBoost in sklearn and XGBoost. Computation time is listed below. All algorithms achieve similar accuracies.

Instruction

Pull the repo in master branch.

cd into "/build" folder.

run "cmake.sh".

cd back to root folder.

type "mpirun -n #numnodes ./cluster --round=#numTreePerNode --threads=#threadPerWorker --batch=#samplingSize --bundle=#(true/false)batchFunctionality "

PS: If you used batch delivery functionality, numTreePerNode must be multiple of 10.

Performance

Scalability and efficiency of algorithm will be tested. Training speed and prediction accuracy are comparable with an existing Gradient Boosting Tree library.

Training size (88896, 17)

Discussion

The algorithm is faster than sklearn even using single thread, and achieves multiple times of speedup with multi-threading. We plan to continue developing the repo and make it an open source project. The target is to achieve better speed and accuracy than all existing packages.

Reference

Wikipedia, https://en.wikipedia.org/wiki/Gradient_boosting
Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.
Friedman, Jerome H. "Stochastic gradient boosting." Computational Statistics & Data Analysis 38.4 (2002): 367-378.
Roe, Byron P., et al. "Boosted decision trees as an alternative to artificial neural networks for particle identification." Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 543.2 (2005): 577-584.
Bissacco, Alessandro, Ming-Hsuan Yang, and Stefano Soatto. "Fast human pose estimation using appearance and motion via multi-dimensional boosting regression." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007.
Mohan, Ananth, Zheng Chen, and Kilian Q. Weinberger. "Web-search ranking with initialized gradient boosted regression trees." Yahoo! Learning to Rank Challenge. 2011.
Chapelle, Olivier, and Yi Chang. "Yahoo! learning to rank challenge overview." Yahoo! Learning to Rank Challenge. 2011.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
ReadMeImages		ReadMeImages
build		build
data		data
iPython_test		iPython_test
include		include
src		src
tests		tests
1		1
2		2
3		3
4		4
5		5
BATCH_DATA_FILE		BATCH_DATA_FILE
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pythonAPI.py		pythonAPI.py
run.sh		run.sh
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Asynchronous Boosting Tree (DAB Tree)

Introduction

What is a Tree

What is a Gradient Boosting Tree

Goal

Design

Pipeline of training one tree

Local Training Process

MPI Method

Batch Delivery Mode

Implementation

Instruction

Performance

Training size (88896, 17)

Discussion

Reference

About

Releases

Packages

Contributors 3

Languages

License

CrimsonInn/dab-tree

Folders and files

Latest commit

History

Repository files navigation

Distributed Asynchronous Boosting Tree (DAB Tree)

Introduction

What is a Tree

What is a Gradient Boosting Tree

Goal

Design

Pipeline of training one tree

Local Training Process

MPI Method

Batch Delivery Mode

Implementation

Instruction

Performance

Training size (88896, 17)

Discussion

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages