[DIST] Add Distributed XGBoost on AWS Tutorial

ajaysaini725 · Feb 26, 2016 · a71ba04 · a71ba04
1 parent 61d9edc
commit a71ba04
Show file tree

Hide file tree

Showing 11 changed files with 355 additions and 86 deletions.
diff --git a/README.md b/README.md
@@ -20,6 +20,7 @@ The same code runs on major distributed environment(Hadoop, SGE, MPI) and can so
 
 What's New
 ----------
+* [Distributed XGBoost on AWS with YARN](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html)
 * [XGBoost brick](NEWS.md) Release
 
 

diff --git a/demo/binary_classification/mushroom.conf b/demo/binary_classification/mushroom.conf
@@ -6,24 +6,24 @@ objective = binary:logistic
 
 # Tree Booster Parameters
 # step size shrinkage
-eta = 1.0 
+eta = 1.0
 # minimum loss reduction required to make a further partition
-gamma = 1.0 
+gamma = 1.0
 # minimum sum of instance weight(hessian) needed in a child
-min_child_weight = 1 
+min_child_weight = 1
 # maximum depth of a tree
-max_depth = 3 
+max_depth = 3
 
 # Task Parameters
 # the number of round to do boosting
 num_round = 2
 # 0 means do not save any model except the final round model
-save_period = 0 
+save_period = 0
 # The path of training data
-data = "agaricus.txt.train" 
+data = "agaricus.txt.train"
 # The path of validation data, used to monitor training process, here [test] sets name of the validation set
-eval[test] = "agaricus.txt.test" 
+eval[test] = "agaricus.txt.test"
 # evaluate on training data as well each round
 eval_train = 1
-# The path of test data 
-test:data = "agaricus.txt.test"      
+# The path of test data
+test:data = "agaricus.txt.test"
diff --git a/demo/distributed-training/README.md b/demo/distributed-training/README.md
@@ -10,43 +10,14 @@ Build XGBoost with Distributed Filesystem Support
 To use distributed xgboost, you only need to turn the options on to build
 with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.
 
-How to Use
-----------
-* Input data format: LIBSVM format. The example here uses generated data in ../data folder.
-* Put the data into some distribute filesytem (S3 or HDFS)
-* Use tracker script in dmlc-core/tracker to submit the jobs
-* Like all other DMLC tools, xgboost support taking a path to a folder as input argument
-  - All the files in the folder will be used as input
-* Quick start in Hadoop YARN: run ```bash run_yarn.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```
 
-Example
--------
-* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN.
+Step by Step Tutorial on AWS
+----------------------------
+Checkout [this tutorial](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html) for running distributed xgboost.
 
-Single machine vs Distributed Version
--------------------------------------
-If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
-* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
-* File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
-  - ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf".
-  - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2```
-  - The local path of cached files in command is "./".
-* More details of submission can be referred to the usage of ```dmlc_yarn.py```.
-* The model saved by hadoop version is compatible with single machine version.
 
-Notes
------
-* The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance.
-  - You will want to set <n_thread_per_worker> to be number of cores you have on each machine.
-
-
-External Memory Version
------------------------
-XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data.
-See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory.
-
-You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as
-```
-data=hdfs:///path-to-my-data/#dtrain.cache
-```
-This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset.
+Model Analysis
+--------------
+XGBoost is exchangable across all bindings and platforms.
+This means you can use python or R to analyze the learnt model and do prediction.
+For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model.
diff --git a/demo/distributed-training/mushroom.aws.conf b/demo/distributed-training/mushroom.aws.conf
@@ -0,0 +1,27 @@
+# General Parameters, see comment for each definition
+# choose the booster, can be gbtree or gblinear
+booster = gbtree
+# choose logistic regression loss function for binary classification
+objective = binary:logistic
+
+# Tree Booster Parameters
+# step size shrinkage
+eta = 1.0
+# minimum loss reduction required to make a further partition
+gamma = 1.0
+# minimum sum of instance weight(hessian) needed in a child
+min_child_weight = 1
+# maximum depth of a tree
+max_depth = 3
+
+# Task Parameters
+# the number of round to do boosting
+num_round = 2
+# 0 means do not save any model except the final round model
+save_period = 0
+# The path of training data
+data = "s3://mybucket/xgb-demo/train"
+# The path of validation data, used to monitor training process, here [test] sets name of the validation set
+# evaluate on training data as well each round
+eval_train = 1
+
diff --git a/demo/distributed-training/plot_model.ipynb b/demo/distributed-training/plot_model.ipynb
@@ -0,0 +1,107 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# XGBoost Model Analysis\n",
+    "\n",
+    "This notebook can be used to load and anlysis model learnt from all xgboost bindings, including distributed training. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "import os\n",
+    "%matplotlib inline "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Please change the ```pkg_path``` and ```model_file``` to be correct path"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "pkg_path = '../../python-package/'\n",
+    "model_file = 's3://my-bucket/xgb-demo/model/0002.model'\n",
+    "sys.path.insert(0, pkg_path)\n",
+    "import xgboost as xgb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Plot the Feature Importance"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# plot the first two trees.\n",
+    "bst = xgb.Booster(model_file=model_file)\n",
+    "xgb.plot_importance(bst)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Plot the First Tree"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "tree_id = 0\n",
+    "xgb.to_graphviz(bst, tree_id)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/demo/distributed-training/run_aws.sh b/demo/distributed-training/run_aws.sh
@@ -0,0 +1,11 @@
+# This is the example script to run distributed xgboost on AWS.
+# Change the following two lines for configuration
+
+export BUCKET=mybucket
+
+# submit the job to YARN
+../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\
+    ../../xgboost mushroom.aws.conf nthread=2\
+    data=s3://${BUCKET}/xgb-demo/train\
+    eval[test]=s3://${BUCKET}/xgb-demo/test\
+    model_dir=s3://${BUCKET}/xgb-demo/model
diff --git a/demo/distributed-training/run_yarn.sh b/demo/distributed-training/run_yarn.sh
diff --git a/dmlc-core b/dmlc-core
diff --git a/doc/index.md b/doc/index.md
@@ -23,7 +23,7 @@ This section contains users guides that are general across languages.
 
 * [Installation Guide](build.md)
 * [Introduction to Boosted Trees](model.md)
-* [Distributed Training](../demo/distributed-training)
+* [Distributed Training Tutorial](tutorial/aws_yarn.md)
 * [Frequently Asked Questions](faq.md)
 * [External Memory Version](external_memory.md)
 * [Learning to use XGBoost by Example](../demo)
-Original file line number
+Diff line change
@@ Expand Up @@
     What's New
     ----------
+    * [Distributed XGBoost on AWS with YARN](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html)
     * [XGBoost brick](NEWS.md) Release
@@ Expand Down @@
+4 −1		Makefile
+1 −1		src/io/input_split_base.h
+42 −4		tracker/README.md
+9 −0		tracker/dmlc-submit
+0 −100		tracker/dmlc_local.py
+0 −71		tracker/dmlc_mpi.py
+0 −77		tracker/dmlc_sge.py
+2 −0		tracker/dmlc_tracker/__init__.py
+68 −0		tracker/dmlc_tracker/launcher.py
+83 −0		tracker/dmlc_tracker/local.py
+63 −0		tracker/dmlc_tracker/mpi.py
+147 −0		tracker/dmlc_tracker/opts.py
+48 −0		tracker/dmlc_tracker/sge.py
+50 −0		tracker/dmlc_tracker/submit.py
+58 −60		tracker/dmlc_tracker/tracker.py
+125 −0		tracker/dmlc_tracker/yarn.py
+0 −200		tracker/dmlc_yarn.py
+0 −0		tracker/yarn/.gitignore
+3 −3		tracker/yarn/README.md
+0 −0		tracker/yarn/build.bat
+0 −0		tracker/yarn/build.sh
+0 −0		tracker/yarn/pom.xml
+107 −52		tracker/yarn/src/main/java/org/apache/hadoop/yarn/dmlc/ApplicationMaster.java
+17 −2		tracker/yarn/src/main/java/org/apache/hadoop/yarn/dmlc/Client.java
+0 −0		tracker/yarn/src/main/java/org/apache/hadoop/yarn/dmlc/TaskRecord.java
+0 −49		yarn/run_hdfs_prog.py