diff --git a/README.md b/README.md index 524877d01e28..ce2e5ac110e8 100644 --- a/README.md +++ b/README.md @@ -20,6 +20,7 @@ The same code runs on major distributed environment(Hadoop, SGE, MPI) and can so What's New ---------- +* [Distributed XGBoost on AWS with YARN](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html) * [XGBoost brick](NEWS.md) Release diff --git a/demo/binary_classification/mushroom.conf b/demo/binary_classification/mushroom.conf index d2566f132bf0..435c9bf8dda3 100644 --- a/demo/binary_classification/mushroom.conf +++ b/demo/binary_classification/mushroom.conf @@ -6,24 +6,24 @@ objective = binary:logistic # Tree Booster Parameters # step size shrinkage -eta = 1.0 +eta = 1.0 # minimum loss reduction required to make a further partition -gamma = 1.0 +gamma = 1.0 # minimum sum of instance weight(hessian) needed in a child -min_child_weight = 1 +min_child_weight = 1 # maximum depth of a tree -max_depth = 3 +max_depth = 3 # Task Parameters # the number of round to do boosting num_round = 2 # 0 means do not save any model except the final round model -save_period = 0 +save_period = 0 # The path of training data -data = "agaricus.txt.train" +data = "agaricus.txt.train" # The path of validation data, used to monitor training process, here [test] sets name of the validation set -eval[test] = "agaricus.txt.test" +eval[test] = "agaricus.txt.test" # evaluate on training data as well each round eval_train = 1 -# The path of test data -test:data = "agaricus.txt.test" +# The path of test data +test:data = "agaricus.txt.test" diff --git a/demo/distributed-training/README.md b/demo/distributed-training/README.md index 3926612cc2b3..57831736ca9e 100644 --- a/demo/distributed-training/README.md +++ b/demo/distributed-training/README.md @@ -10,43 +10,14 @@ Build XGBoost with Distributed Filesystem Support To use distributed xgboost, you only need to turn the options on to build with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```. -How to Use ----------- -* Input data format: LIBSVM format. The example here uses generated data in ../data folder. -* Put the data into some distribute filesytem (S3 or HDFS) -* Use tracker script in dmlc-core/tracker to submit the jobs -* Like all other DMLC tools, xgboost support taking a path to a folder as input argument - - All the files in the folder will be used as input -* Quick start in Hadoop YARN: run ```bash run_yarn.sh ``` -Example -------- -* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN. +Step by Step Tutorial on AWS +---------------------------- +Checkout [this tutorial](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html) for running distributed xgboost. -Single machine vs Distributed Version -------------------------------------- -If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file. -* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access -* File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file - - ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf". - - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2``` - - The local path of cached files in command is "./". -* More details of submission can be referred to the usage of ```dmlc_yarn.py```. -* The model saved by hadoop version is compatible with single machine version. -Notes ------ -* The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance. - - You will want to set to be number of cores you have on each machine. - - -External Memory Version ------------------------ -XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data. -See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory. - -You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as -``` -data=hdfs:///path-to-my-data/#dtrain.cache -``` -This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset. +Model Analysis +-------------- +XGBoost is exchangable across all bindings and platforms. +This means you can use python or R to analyze the learnt model and do prediction. +For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model. diff --git a/demo/distributed-training/mushroom.aws.conf b/demo/distributed-training/mushroom.aws.conf new file mode 100644 index 000000000000..04283768c33a --- /dev/null +++ b/demo/distributed-training/mushroom.aws.conf @@ -0,0 +1,27 @@ +# General Parameters, see comment for each definition +# choose the booster, can be gbtree or gblinear +booster = gbtree +# choose logistic regression loss function for binary classification +objective = binary:logistic + +# Tree Booster Parameters +# step size shrinkage +eta = 1.0 +# minimum loss reduction required to make a further partition +gamma = 1.0 +# minimum sum of instance weight(hessian) needed in a child +min_child_weight = 1 +# maximum depth of a tree +max_depth = 3 + +# Task Parameters +# the number of round to do boosting +num_round = 2 +# 0 means do not save any model except the final round model +save_period = 0 +# The path of training data +data = "s3://mybucket/xgb-demo/train" +# The path of validation data, used to monitor training process, here [test] sets name of the validation set +# evaluate on training data as well each round +eval_train = 1 + diff --git a/demo/distributed-training/plot_model.ipynb b/demo/distributed-training/plot_model.ipynb new file mode 100644 index 000000000000..5cfccce20a70 --- /dev/null +++ b/demo/distributed-training/plot_model.ipynb @@ -0,0 +1,107 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# XGBoost Model Analysis\n", + "\n", + "This notebook can be used to load and anlysis model learnt from all xgboost bindings, including distributed training. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import sys\n", + "import os\n", + "%matplotlib inline " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Please change the ```pkg_path``` and ```model_file``` to be correct path" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pkg_path = '../../python-package/'\n", + "model_file = 's3://my-bucket/xgb-demo/model/0002.model'\n", + "sys.path.insert(0, pkg_path)\n", + "import xgboost as xgb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Plot the Feature Importance" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# plot the first two trees.\n", + "bst = xgb.Booster(model_file=model_file)\n", + "xgb.plot_importance(bst)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Plot the First Tree" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "tree_id = 0\n", + "xgb.to_graphviz(bst, tree_id)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/demo/distributed-training/run_aws.sh b/demo/distributed-training/run_aws.sh new file mode 100644 index 000000000000..0b7cb17d29c6 --- /dev/null +++ b/demo/distributed-training/run_aws.sh @@ -0,0 +1,11 @@ +# This is the example script to run distributed xgboost on AWS. +# Change the following two lines for configuration + +export BUCKET=mybucket + +# submit the job to YARN +../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\ + ../../xgboost mushroom.aws.conf nthread=2\ + data=s3://${BUCKET}/xgb-demo/train\ + eval[test]=s3://${BUCKET}/xgb-demo/test\ + model_dir=s3://${BUCKET}/xgb-demo/model diff --git a/demo/distributed-training/run_yarn.sh b/demo/distributed-training/run_yarn.sh deleted file mode 100755 index 3d7c6bf05b87..000000000000 --- a/demo/distributed-training/run_yarn.sh +++ /dev/null @@ -1,33 +0,0 @@ -#!/bin/bash -if [ "$#" -lt 3 ]; -then - echo "Usage: " - exit -1 -fi - -# put the local training file to HDFS -hadoop fs -mkdir $3/data -hadoop fs -put ../data/agaricus.txt.train $3/data -hadoop fs -put ../data/agaricus.txt.test $3/data - -# running rabit, pass address in hdfs -../../dmlc-core/tracker/dmlc_yarn.py -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\ - data=hdfs://$3/data/agaricus.txt.train\ - eval[test]=hdfs://$3/data/agaricus.txt.test\ - model_out=hdfs://$3/mushroom.final.model - -# get the final model file -hadoop fs -get $3/mushroom.final.model final.model - -# use dmlc-core/yarn/run_hdfs_prog.py to setup approperiate env - -# output prediction task=pred -#../../xgboost.dmlc mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test -../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test -# print the boosters of final.model in dump.raw.txt -#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt -../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt -# use the feature map in printing for better visualization -#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt -../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt -cat dump.nice.txt diff --git a/dmlc-core b/dmlc-core index 0f8fd38bf94e..38ee75d95ff2 160000 --- a/dmlc-core +++ b/dmlc-core @@ -1 +1 @@ -Subproject commit 0f8fd38bf94e6666aa367be80195b1f2da87428c +Subproject commit 38ee75d95ff23e4e1febacc89e08975d9b6c6c3a diff --git a/doc/index.md b/doc/index.md index 58f3d7d5f1ce..ab8d50c49a10 100644 --- a/doc/index.md +++ b/doc/index.md @@ -23,7 +23,7 @@ This section contains users guides that are general across languages. * [Installation Guide](build.md) * [Introduction to Boosted Trees](model.md) -* [Distributed Training](../demo/distributed-training) +* [Distributed Training Tutorial](tutorial/aws_yarn.md) * [Frequently Asked Questions](faq.md) * [External Memory Version](external_memory.md) * [Learning to use XGBoost by Example](../demo) diff --git a/doc/tutorial/aws_yarn.md b/doc/tutorial/aws_yarn.md new file mode 100644 index 000000000000..fb1dcd8faef4 --- /dev/null +++ b/doc/tutorial/aws_yarn.md @@ -0,0 +1,187 @@ +Distributed XGBoost YARN on AWS +=============================== +This is a step-by-step tutorial on how to setup and run distributed [XGBoost](https://github.com/dmlc/xgboost) +on a AWS EC2 cluster. Distributed XGBoost runs on various platforms such as MPI, SGE and Hadoop YARN. +In this tutorial, we use YARN as an example since this is widely used solution for distributed computing. + +Prerequisite +------------ +We need to get a [AWS key-pair](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) +to access the AWS services. Let us assume that we are using a key ```mykey``` and the corresponding permission file ```mypem.pem```. + +We also need [AWS credentials](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html), +which includes an `ACCESS_KEY_ID` and a `SECRET_ACCESS_KEY`. + +Finally, we will need a S3 bucket to host the data and the model, ```s3://mybucket/``` + +Setup a Hadoop YARN Cluster +--------------------------- +This sections shows how to start a Hadoop YARN cluster from scratch. +You can skip this step if you have already have one. +We will be using [yarn-ec2](https://github.com/tqchen/yarn-ec2) to start the cluster. + +We can first clone the yarn-ec2 script by the following command. +```bash +git clone https://github.com/tqchen/yarn-ec2 +``` + +To use the script, we must set the environment variables `AWS_ACCESS_KEY_ID` and +`AWS_SECRET_ACCESS_KEY` properly. This can be done by adding the following two lines in +`~/.bashrc` (replacing the strings with the correct ones) + +```bash +export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE +export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY +``` + +Now we can launch a master machine of the cluster from EC2 +```bash +./yarn-ec2 -k mykey -i mypem.pem launch xgboost +``` +Wait a few mininutes till the master machine get up. + +After the master machine gets up, we can query the public DNS of the master machine using the following command. +```bash +./yarn-ec2 -k mykey -i mypem.pem get-master xgboost +``` +It will show the public DNS of the master machine like ```ec2-xx-xx-xx.us-west-2.compute.amazonaws.com``` +Now we can open the browser, and type(replace the DNS with the master DNS) +``` +ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088 +``` +This will show the job tracker of the YARN cluster. Note that we may wait a few minutes before the master finishes bootstraping and starts the +job tracker. + +After master machine gets up, we can freely add more slave machines to the cluster. +The following command add m3.xlarge instances to the cluster. +```bash +./yarn-ec2 -k mykey -i mypem.pem -t m3.xlarge -s 2 addslave xgboost +``` +We can also choose to add two spot instances +```bash +./yarn-ec2 -k mykey -i mypem.pem -t m3.xlarge -s 2 addspot xgboost +``` +The slave machines will startup, bootstrap and report to the master. +You can check if the slave machines are connected by clicking on Nodes link on the job tracker. +Or simply type the following URL(replace DNS ith the master DNS) +``` +ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088/cluster/nodes +``` + +One thing we should note is that not all the links in the job tracker works. +This is due to that many of them uses the private ip of AWS, which can only be accessed by EC2. +We can use ssh proxy to access these packages. +Now that we have setup a cluster with one master and two slaves. We are ready to run the experiment. + + +Build XGBoost with S3 +--------------------- +We can log into the master machine by the following command. +```bash +./yarn-ec2 -k mykey -i mypem.pem login xgboost +``` + +We will be using S3 to host the data and the result model, so the data won't get lost after the cluster shutdown. +To do so, we will need to build xgboost with S3 support. The only thing we need to do is to set ```USE_S3``` +variable to be true. This can be achieved by the following command. + +```bash +git clone --recursive https://github.com/dmlc/xgboost +cd xgboost +cp make/config.mk config.mk +echo "USE_S3=1" >> config.mk +make -j4 +``` +Now we have built the XGBoost with S3 support. You can also enable HDFS support if you plan to store data on HDFS, by turnning on ```USE_HDFS``` option. + +XGBoost also relies on the environment variable to access S3, so you will need to add the following two lines to `~/.bashrc` (replacing the strings with the correct ones) +on the master machine as well. + +```bash +export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE +export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY +export BUCKET=mybucket +``` + +Host the Data on S3 +------------------- +In this example, we will copy the example dataset in xgboost to the S3 bucket as input. +In normal usecases, the dataset is usually created from existing distributed processing pipeline. +We can use [s3cmd](http://s3tools.org/s3cmd) to copy the data into mybucket(replace ${BUCKET} with the real bucket name). + +```bash +cd xgboost +s3cmd put demo/data/agaricus.txt.train s3://${BUCKET}/xgb-demo/train/ +s3cmd put demo/data/agaricus.txt.test s3://${BUCKET}/xgb-demo/test/ +``` + +Submit the Jobs +--------------- +Now everything is ready, we can submit the xgboost distributed job to the YARN cluster. +We will use the [dmlc-submit](https://github.com/dmlc/dmlc-core/tree/master/tracker) script to submit the job. + +Now we can run the following script in the distributed training folder(replace ${BUCKET} with the real bucket name) +```bash +cd xgboost/demo/distributed-training +# Use dmlc-submit to submit the job. +../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\ + ../../xgboost mushroom.aws.conf nthread=2\ + data=s3://${BUCKET}/xgb-demo/train\ + eval[test]=s3://${BUCKET}/xgb-demo/test\ + model_dir=s3://${BUCKET}/xgb-demo/model +``` +All the configurations such as ```data``` and ```model_dir``` can also be directly written into the configuration file. +Note that we only specified the folder path to the file, instead of the file name. +XGBoost will read in all the files in that folder as the training and evaluation data. + +In this command, we are using two workers, each worker uses two running thread. +XGBoost can benefit from using multiple cores in each worker. +A common choice of working cores can range from 4 to 8. +The trained model will be saved into the specified model folder. You can browse the model folder. +``` +s3cmd ls s3://${BUCKET}/xgb-demo/model/ +``` + +The following is an example output from distributed training. +``` +16/02/26 05:41:59 INFO dmlc.Client: jobname=DMLC[nworker=2]:xgboost,username=ubuntu +16/02/26 05:41:59 INFO dmlc.Client: Submitting application application_1456461717456_0015 +16/02/26 05:41:59 INFO impl.YarnClientImpl: Submitted application application_1456461717456_0015 +2016-02-26 05:42:05,230 INFO @tracker All of 2 nodes getting started +2016-02-26 05:42:14,027 INFO [05:42:14] [0] test-error:0.016139 train-error:0.014433 +2016-02-26 05:42:14,186 INFO [05:42:14] [1] test-error:0.000000 train-error:0.001228 +2016-02-26 05:42:14,947 INFO @tracker All nodes finishes job +2016-02-26 05:42:14,948 INFO @tracker 9.71754479408 secs between node start and job finish +Application application_1456461717456_0015 finished with state FINISHED at 1456465335961 +``` + +Analyze the Model +----------------- +After the model is trained, we can analyse the learnt model and use it for future prediction task. +XGBoost is a portable framework, the model in all platforms are ***exchangable***. +This means we can load the trained model in python/R/Julia and take benefit of data science pipelines +in these languages to do model analysis and prediction. + +For example, you can use [this ipython notebook](https://github.com/dmlc/xgboost/tree/master/demo/distributed-training/plot_model.ipynb) +to plot feature importance and visualize the learnt model. + +Trouble Shooting +---------------- + +When you encountered a problem, the best way might be use the following command +to get logs of stdout and stderr of the containers, to check what causes the problem. +``` +yarn logs -applicationId yourAppId +``` + +Future Directions +----------------- +You have learnt to use distributed XGBoost on YARN in this tutorial. +XGBoost is portable and scalable framework for gradient boosting. +You can checkout more examples and resources in the [resources page](https://github.com/dmlc/xgboost/blob/master/demo/README.md). + +The project goal is to make the best scalable machine learning solution available to all platforms. +The API is designed to be able to portable, and the same code can also run on other platforms such as MPI and SGE. +XGBoost is actively evolving and we are working on even more exciting features +such as distributed xgboost python/R package. Checkout [RoadMap](https://github.com/dmlc/xgboost/issues/873) for +more details and you are more than welcomed to contribute to the project. diff --git a/python-package/xgboost/core.py b/python-package/xgboost/core.py index 62f32e03c71d..c2840bff5624 100644 --- a/python-package/xgboost/core.py +++ b/python-package/xgboost/core.py @@ -880,11 +880,9 @@ def load_model(self, fname): fname : string or a memory buffer Input file name or memory buffer(see also save_raw) """ - if isinstance(fname, STRING_TYPES): # assume file name - if os.path.exists(fname): - _LIB.XGBoosterLoadModel(self.handle, c_str(fname)) - else: - raise ValueError("No such file: {0}".format(fname)) + if isinstance(fname, STRING_TYPES): + # assume file name, cannot use os.path.exist to check, file can be from URL. + _check_call(_LIB.XGBoosterLoadModel(self.handle, c_str(fname))) else: buf = fname length = ctypes.c_ulong(len(buf))