Skip to content

Commit

Permalink
[DIST] Add Distributed XGBoost on AWS Tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
tqchen committed Feb 26, 2016
1 parent 61d9edc commit a71ba04
Show file tree
Hide file tree
Showing 11 changed files with 355 additions and 86 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ The same code runs on major distributed environment(Hadoop, SGE, MPI) and can so

What's New
----------
* [Distributed XGBoost on AWS with YARN](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html)
* [XGBoost brick](NEWS.md) Release


Expand Down
18 changes: 9 additions & 9 deletions demo/binary_classification/mushroom.conf
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,24 @@ objective = binary:logistic

# Tree Booster Parameters
# step size shrinkage
eta = 1.0
eta = 1.0
# minimum loss reduction required to make a further partition
gamma = 1.0
gamma = 1.0
# minimum sum of instance weight(hessian) needed in a child
min_child_weight = 1
min_child_weight = 1
# maximum depth of a tree
max_depth = 3
max_depth = 3

# Task Parameters
# the number of round to do boosting
num_round = 2
# 0 means do not save any model except the final round model
save_period = 0
save_period = 0
# The path of training data
data = "agaricus.txt.train"
data = "agaricus.txt.train"
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
eval[test] = "agaricus.txt.test"
eval[test] = "agaricus.txt.test"
# evaluate on training data as well each round
eval_train = 1
# The path of test data
test:data = "agaricus.txt.test"
# The path of test data
test:data = "agaricus.txt.test"
45 changes: 8 additions & 37 deletions demo/distributed-training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,43 +10,14 @@ Build XGBoost with Distributed Filesystem Support
To use distributed xgboost, you only need to turn the options on to build
with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.

How to Use
----------
* Input data format: LIBSVM format. The example here uses generated data in ../data folder.
* Put the data into some distribute filesytem (S3 or HDFS)
* Use tracker script in dmlc-core/tracker to submit the jobs
* Like all other DMLC tools, xgboost support taking a path to a folder as input argument
- All the files in the folder will be used as input
* Quick start in Hadoop YARN: run ```bash run_yarn.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```

Example
-------
* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN.
Step by Step Tutorial on AWS
----------------------------
Checkout [this tutorial](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html) for running distributed xgboost.

Single machine vs Distributed Version
-------------------------------------
If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
* File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
- ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf".
- You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2```
- The local path of cached files in command is "./".
* More details of submission can be referred to the usage of ```dmlc_yarn.py```.
* The model saved by hadoop version is compatible with single machine version.

Notes
-----
* The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance.
- You will want to set <n_thread_per_worker> to be number of cores you have on each machine.


External Memory Version
-----------------------
XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data.
See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory.

You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as
```
data=hdfs:///path-to-my-data/#dtrain.cache
```
This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset.
Model Analysis
--------------
XGBoost is exchangable across all bindings and platforms.
This means you can use python or R to analyze the learnt model and do prediction.
For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model.
27 changes: 27 additions & 0 deletions demo/distributed-training/mushroom.aws.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# General Parameters, see comment for each definition
# choose the booster, can be gbtree or gblinear
booster = gbtree
# choose logistic regression loss function for binary classification
objective = binary:logistic

# Tree Booster Parameters
# step size shrinkage
eta = 1.0
# minimum loss reduction required to make a further partition
gamma = 1.0
# minimum sum of instance weight(hessian) needed in a child
min_child_weight = 1
# maximum depth of a tree
max_depth = 3

# Task Parameters
# the number of round to do boosting
num_round = 2
# 0 means do not save any model except the final round model
save_period = 0
# The path of training data
data = "s3://mybucket/xgb-demo/train"
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
# evaluate on training data as well each round
eval_train = 1

107 changes: 107 additions & 0 deletions demo/distributed-training/plot_model.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# XGBoost Model Analysis\n",
"\n",
"This notebook can be used to load and anlysis model learnt from all xgboost bindings, including distributed training. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import sys\n",
"import os\n",
"%matplotlib inline "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Please change the ```pkg_path``` and ```model_file``` to be correct path"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pkg_path = '../../python-package/'\n",
"model_file = 's3://my-bucket/xgb-demo/model/0002.model'\n",
"sys.path.insert(0, pkg_path)\n",
"import xgboost as xgb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Plot the Feature Importance"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# plot the first two trees.\n",
"bst = xgb.Booster(model_file=model_file)\n",
"xgb.plot_importance(bst)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Plot the First Tree"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tree_id = 0\n",
"xgb.to_graphviz(bst, tree_id)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
11 changes: 11 additions & 0 deletions demo/distributed-training/run_aws.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# This is the example script to run distributed xgboost on AWS.
# Change the following two lines for configuration

export BUCKET=mybucket

# submit the job to YARN
../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\
../../xgboost mushroom.aws.conf nthread=2\
data=s3://${BUCKET}/xgb-demo/train\
eval[test]=s3://${BUCKET}/xgb-demo/test\
model_dir=s3://${BUCKET}/xgb-demo/model
33 changes: 0 additions & 33 deletions demo/distributed-training/run_yarn.sh

This file was deleted.

2 changes: 1 addition & 1 deletion doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ This section contains users guides that are general across languages.

* [Installation Guide](build.md)
* [Introduction to Boosted Trees](model.md)
* [Distributed Training](../demo/distributed-training)
* [Distributed Training Tutorial](tutorial/aws_yarn.md)
* [Frequently Asked Questions](faq.md)
* [External Memory Version](external_memory.md)
* [Learning to use XGBoost by Example](../demo)
Expand Down
Loading

0 comments on commit a71ba04

Please sign in to comment.