-
Notifications
You must be signed in to change notification settings - Fork 648
/
sparkml.json
1 lines (1 loc) · 29.7 KB
/
sparkml.json
1
{"paragraphs":[{"text":"%md\n\n# Binary Classification Pipeline in Spark\n\nThis documentation walks through an example of:\n\n- Using Spark MLlib using the Spark ml's pipeline.\n- Interacting with a [local pseudo-distributed hadoop cluster](https://github.com/ethen8181/machine-learning/blob/master/big_data/localhadoop.md).\n- Includes some zeppelin tricks that I've learned along the way.","user":"anonymous","dateUpdated":"2018-03-11T22:25:55-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<h1>Binary Classification Pipeline in Spark</h1>\n<p>This documentation walks through an example of:</p>\n<ul>\n <li>Using Spark MLlib using the Spark ml’s pipeline.</li>\n <li>Interacting with a <a href=\"https://github.com/ethen8181/machine-learning/blob/master/big_data/localhadoop.md\">local pseudo-distributed hadoop cluster</a>.</li>\n <li>Includes some zeppelin tricks that I’ve learned along the way.</li>\n</ul>\n</div>"}]},"apps":[],"jobName":"paragraph_1520140303995_851123783","id":"20180303-211143_757226532","dateCreated":"2018-03-03T21:11:43-0800","dateStarted":"2018-03-11T22:25:55-0700","dateFinished":"2018-03-11T22:25:55-0700","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:3368"},{"text":"// 1. personal preference, change zeppelin.spark interpreter so only explicit print statements are output\n// https://stackoverflow.com/questions/31841828/how-to-suppress-printing-of-variable-values-in-zeppelin\n\n// 2. avoid nullpointer exception by disabling zeppelin.spark.hive\n// https://stackoverflow.com/questions/43289067/getting-nullpointerexception-when-running-spark-code-in-zeppelin-0-7-1\n\nimport org.apache.spark.sql.SparkSession\nimport org.apache.spark.ml.{Pipeline, PipelineModel}\nimport org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}\nimport org.apache.spark.ml.evaluation.BinaryClassificationEvaluator\nimport org.apache.spark.ml.feature.{StandardScaler, StringIndexer, VectorAssembler}\nimport org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit, TrainValidationSplitModel}\nimport org.apache.spark.sql.{DataFrame, functions => F}\n\nval spark = SparkSession.\n builder.\n master(\"local[*]\").\n appName(\"SparkMLExample\").\n getOrCreate()\nimport spark.implicits._\n","user":"anonymous","dateUpdated":"2018-03-11T22:25:55-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"import org.apache.spark.sql.SparkSession\nimport org.apache.spark.ml.{Pipeline, PipelineModel}\nimport org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}\nimport org.apache.spark.ml.evaluation.BinaryClassificationEvaluator\nimport org.apache.spark.ml.feature.{StandardScaler, StringIndexer, VectorAssembler}\nimport org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit, TrainValidationSplitModel}\nimport org.apache.spark.sql.{DataFrame, functions=>F}\nspark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@7ee1441f\nimport spark.implicits._\n"}]},"apps":[],"jobName":"paragraph_1520083397893_-1694265805","id":"20180303-052317_1258900836","dateCreated":"2018-03-03T05:23:17-0800","dateStarted":"2018-03-11T22:25:55-0700","dateFinished":"2018-03-11T22:25:57-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3369"},{"text":"%md\n\nBefore moving on to using Spark MLlib, we will need to download some data to experiement with. The bash script below downloads the data from the internet and uploads the data to a HDFS path. This is just to gain some experience in using HDFS, feel free to change it to a local path if you wish to avoid the hassle.\n\n```bash\n# you can change the destination to be wherever you like\ndestination=~/machine-learning/big_data/sparkml\ncd destination\n\n# a little python script to download the adult.csv data we'll be using\n# https://github.com/ethen8181/machine-learning/blob/master/big_data/sparkml/get_data.py\npython get_data.py\n\n# upload the data to HDFS, the link below contains instruction on how to set up\n# a pseudo-distributed hadoop cluster on your local machine if interested, or simply\n# skip it and change the path in later code chunk to point to a local path\n# https://github.com/ethen8181/machine-learning/blob/master/big_data/localhadoop.md\nhdfs dfs -copyFromLocal $destination/adult.csv /user/ethen/adult.csv\n```","user":"anonymous","dateUpdated":"2018-03-11T22:25:55-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<p>Before moving on to using Spark MLlib, we will need to download some data to experiement with. The bash script below downloads the data from the internet and uploads the data to a HDFS path. This is just to gain some experience in using HDFS, feel free to change it to a local path if you wish to avoid the hassle.</p>\n<pre><code class=\"bash\"># you can change the destination to be wherever you like\ndestination=~/machine-learning/big_data/sparkml\ncd destination\n\n# a little python script to download the adult.csv data we'll be using\n# https://github.com/ethen8181/machine-learning/blob/master/big_data/sparkml/get_data.py\npython get_data.py\n\n# upload the data to HDFS, the link below contains instruction on how to set up\n# a pseudo-distributed hadoop cluster on your local machine if interested, or simply\n# skip it and change the path in later code chunk to point to a local path\n# https://github.com/ethen8181/machine-learning/blob/master/big_data/localhadoop.md\nhdfs dfs -copyFromLocal $destination/adult.csv /user/ethen/adult.csv\n</code></pre>\n</div>"}]},"apps":[],"jobName":"paragraph_1520723942907_1058694735","id":"20180310-151902_807152308","dateCreated":"2018-03-10T15:19:02-0800","dateStarted":"2018-03-11T22:25:55-0700","dateFinished":"2018-03-11T22:25:55-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3370"},{"text":"// note that in the hdfs path there's only 1 \"/\" symbol\n// https://stackoverflow.com/questions/27478096/cannot-read-a-file-from-hdfs-using-spark\nval basePath = \"hdfs:/user/ethen/\"\nval data = spark.read.\n option(\"inferSchema\", true).\n option(\"header\", true).\n csv(basePath + \"adult.csv\")\n\nprintln(\"dimension: \" + data.count + \",\" + data.columns.length)\n\n// In Zeppelin we can use z.show(df) to show a prettier formatted table\n// compared to doing [spark dataframe].show\nz.show(data, 6)","user":"anonymous","dateUpdated":"2018-03-11T22:25:55-0700","config":{"colWidth":12,"enabled":true,"results":{"1":{"graph":{"mode":"table","height":300,"optionOpen":true,"setting":{"multiBarChart":{},"stackedAreaChart":{}},"commonSetting":{},"keys":[],"groups":[],"values":[]},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"basePath: String = hdfs:/user/ethen/\ndata: org.apache.spark.sql.DataFrame = [age: int, workclass: string ... 13 more fields]\ndimension: 32561,15\n"},{"type":"TABLE","data":"age\tworkclass\tfnlwgt\teducation\teducation_num\tmarital_status\toccupation\trelationship\trace\tsex\tcapital_gain\tcapital_loss\thours_per_week\tnative_country\tincome\n39\tState-gov\t77516\tBachelors\t13\tNever-married\tAdm-clerical\tNot-in-family\tWhite\tMale\t2174\t0\t40\tUnited-States\t<=50K\n50\tSelf-emp-not-inc\t83311\tBachelors\t13\tMarried-civ-spouse\tExec-managerial\tHusband\tWhite\tMale\t0\t0\t13\tUnited-States\t<=50K\n38\tPrivate\t215646\tHS-grad\t9\tDivorced\tHandlers-cleaners\tNot-in-family\tWhite\tMale\t0\t0\t40\tUnited-States\t<=50K\n53\tPrivate\t234721\t11th\t7\tMarried-civ-spouse\tHandlers-cleaners\tHusband\tBlack\tMale\t0\t0\t40\tUnited-States\t<=50K\n28\tPrivate\t338409\tBachelors\t13\tMarried-civ-spouse\tProf-specialty\tWife\tBlack\tFemale\t0\t0\t40\tCuba\t<=50K\n37\tPrivate\t284582\tMasters\t14\tMarried-civ-spouse\tExec-managerial\tWife\tWhite\tFemale\t0\t0\t40\tUnited-States\t<=50K\n<!--TABLE_COMMENT-->\n<font color=red>Results are limited by 6.</font>"}]},"apps":[],"jobName":"paragraph_1520083424752_1577836411","id":"20180303-052344_1101434739","dateCreated":"2018-03-03T05:23:44-0800","dateStarted":"2018-03-11T22:25:55-0700","dateFinished":"2018-03-11T22:26:02-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3371"},{"text":"// in this code chunk, we specify all the columns/features that we'll be using\n// and since spark requires the labels to be of type double, we transform the\n// string column using StringIndexer\nval rawLabelCol = \"income\"\nval catCols = Array(\n \"workclass\", \"education\", \"marital_status\",\n \"occupation\", \"relationship\", \"race\",\n \"sex\", \"native_country\")\nval numCols = Array(\n \"age\", \"fnlwgt\", \"education_num\",\n \"capital_gain\", \"capital_loss\",\n \"hours_per_week\")\n \nval labelCol = \"label\"\nval labelIndexer = new StringIndexer().\n setInputCol(rawLabelCol).\n setOutputCol(labelCol)\nval dataLabelIndexed = labelIndexer.fit(data).transform(data)\n\nz.show(dataLabelIndexed, 6)","user":"anonymous","dateUpdated":"2018-03-11T22:25:55-0700","config":{"colWidth":12,"enabled":true,"results":{"1":{"graph":{"mode":"table","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"rawLabelCol: String = income\ncatCols: Array[String] = Array(workclass, education, marital_status, occupation, relationship, race, sex, native_country)\nnumCols: Array[String] = Array(age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week)\nlabelCol: String = label\nlabelIndexer: org.apache.spark.ml.feature.StringIndexer = strIdx_e034071d52b7\ndataLabelIndexed: org.apache.spark.sql.DataFrame = [age: int, workclass: string ... 14 more fields]\n"},{"type":"TABLE","data":"age\tworkclass\tfnlwgt\teducation\teducation_num\tmarital_status\toccupation\trelationship\trace\tsex\tcapital_gain\tcapital_loss\thours_per_week\tnative_country\tincome\tlabel\n39\tState-gov\t77516\tBachelors\t13\tNever-married\tAdm-clerical\tNot-in-family\tWhite\tMale\t2174\t0\t40\tUnited-States\t<=50K\t0.0\n50\tSelf-emp-not-inc\t83311\tBachelors\t13\tMarried-civ-spouse\tExec-managerial\tHusband\tWhite\tMale\t0\t0\t13\tUnited-States\t<=50K\t0.0\n38\tPrivate\t215646\tHS-grad\t9\tDivorced\tHandlers-cleaners\tNot-in-family\tWhite\tMale\t0\t0\t40\tUnited-States\t<=50K\t0.0\n53\tPrivate\t234721\t11th\t7\tMarried-civ-spouse\tHandlers-cleaners\tHusband\tBlack\tMale\t0\t0\t40\tUnited-States\t<=50K\t0.0\n28\tPrivate\t338409\tBachelors\t13\tMarried-civ-spouse\tProf-specialty\tWife\tBlack\tFemale\t0\t0\t40\tCuba\t<=50K\t0.0\n37\tPrivate\t284582\tMasters\t14\tMarried-civ-spouse\tExec-managerial\tWife\tWhite\tFemale\t0\t0\t40\tUnited-States\t<=50K\t0.0\n<!--TABLE_COMMENT-->\n<font color=red>Results are limited by 6.</font>"}]},"apps":[],"jobName":"paragraph_1520081993838_-235957668","id":"20180303-045953_778901477","dateCreated":"2018-03-03T04:59:53-0800","dateStarted":"2018-03-11T22:25:58-0700","dateFinished":"2018-03-11T22:26:04-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3372"},{"text":"// visual check to ensure the cardinality of categorical variables\n// is within a reasonable size\nz.show(data.select(catCols.map(c => countDistinct(c).alias(c)): _*))","user":"anonymous","dateUpdated":"2018-03-11T22:25:55-0700","config":{"colWidth":12,"enabled":true,"results":{"0":{"graph":{"mode":"multiBarChart","height":253.011,"optionOpen":false,"setting":{"multiBarChart":{}},"commonSetting":{},"keys":[],"groups":[],"values":[{"name":"education","index":1,"aggr":"sum"},{"name":"workclass","index":0,"aggr":"sum"},{"name":"marital_status","index":2,"aggr":"sum"},{"name":"occupation","index":3,"aggr":"sum"},{"name":"relationship","index":4,"aggr":"sum"},{"name":"race","index":5,"aggr":"sum"},{"name":"sex","index":6,"aggr":"sum"},{"name":"native_country","index":7,"aggr":"sum"}]},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TABLE","data":"workclass\teducation\tmarital_status\toccupation\trelationship\trace\tsex\tnative_country\n9\t16\t7\t15\t6\t5\t2\t42\n"}]},"apps":[],"jobName":"paragraph_1520082011376_1715519401","id":"20180303-050011_1564310069","dateCreated":"2018-03-03T05:00:11-0800","dateStarted":"2018-03-11T22:26:02-0700","dateFinished":"2018-03-11T22:26:06-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3373"},{"text":"val testSize = 0.1\nval splitRandomSeed = 4321L\nval Array(dfTrain, dfTest) = dataLabelIndexed.randomSplit(Array(1 - testSize, testSize), splitRandomSeed)\n\n// cache the dataset in-memory as we'll be reusing them\n// throughout the modeling pipeline\ndfTrain.cache\ndfTest.cache\nprintln(\"training data size: \" + dfTrain.count)\nprintln(\"testing data size: \" + dfTest.count)","user":"anonymous","dateUpdated":"2018-03-11T22:25:55-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"testSize: Double = 0.1\nsplitRandomSeed: Long = 4321\ndfTrain: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [age: int, workclass: string ... 14 more fields]\ndfTest: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [age: int, workclass: string ... 14 more fields]\nres68: dfTrain.type = [age: int, workclass: string ... 14 more fields]\nres69: dfTest.type = [age: int, workclass: string ... 14 more fields]\ntraining data size: 29262\ntesting data size: 3299\n"}]},"apps":[],"jobName":"paragraph_1520107844062_-1847624536","id":"20180303-121044_407600344","dateCreated":"2018-03-03T12:10:44-0800","dateStarted":"2018-03-11T22:26:04-0700","dateFinished":"2018-03-11T22:26:09-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3374"},{"text":"%md\n\nJust like in scikit-learn, spark's machine learning API also introduces the concept of Pipeline. The official documentation listed below has a pretty clear example of how to get a minimal example working.\n\n- [Spark Documentation: ML Pipelines](https://spark.apache.org/docs/latest/ml-pipeline.html)\n\nThe next code chunk aims to implement a scikit-learn like interface (i.e. we can call `.fit` on the data to train the model and `.score` to evaluate it, etc.) that wraps a Spark MLlib pipeline underneath the hood.","user":"anonymous","dateUpdated":"2018-03-11T22:25:55-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<p>Just like in scikit-learn, spark’s machine learning API also introduces the concept of Pipeline. The official documentation listed below has a pretty clear example of how to get a minimal example working.</p>\n<ul>\n <li><a href=\"https://spark.apache.org/docs/latest/ml-pipeline.html\">Spark Documentation: ML Pipelines</a></li>\n</ul>\n<p>The next code chunk aims to implement a scikit-learn like interface (i.e. we can call <code>.fit</code> on the data to train the model and <code>.score</code> to evaluate it, etc.) that wraps a Spark MLlib pipeline underneath the hood.</p>\n</div>"}]},"apps":[],"jobName":"paragraph_1520117223455_-1863910837","id":"20180303-144703_259109443","dateCreated":"2018-03-03T14:47:03-0800","dateStarted":"2018-03-11T22:25:55-0700","dateFinished":"2018-03-11T22:25:55-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3375"},{"text":"\nobject MLUtils {\n val AUCEvaluator = new BinaryClassificationEvaluator().\n setMetricName(\"areaUnderROC\")\n\n\n def checkIsFitted(modelFitted: Option[TrainValidationSplitModel]): TrainValidationSplitModel = {\n modelFitted match {\n case Some(model) => model\n case None => throw new Exception(\"please call .fit(sparkDataFrame) to fit the model first\")\n }\n }\n\n\n class TunedModelPipeline(val numCols: Array[String], val catCols: Array[String], val labelCol: String) {\n private val featureCol = \"features\"\n private val predictionCol = \"prediction\"\n private val probabilityCol = \"probability\"\n\n var tunedModelFitted: Option[TrainValidationSplitModel] = None\n\n def fit(data: DataFrame): this.type = {\n val tunedModel = buildTunedModelPipeline()\n tunedModelFitted = Option(tunedModel.fit(data))\n this\n }\n\n def predict(data: DataFrame): DataFrame = {\n transform(data).select(F.col(predictionCol).cast(\"int\"))\n }\n\n def predictProba(data: DataFrame): DataFrame = {\n transform(data).select(probabilityCol)\n }\n\n def score(data: DataFrame): Double = {\n val dataFitted = transform(data).\n select(probabilityCol, labelCol)\n\n val evaluator = AUCEvaluator.\n setLabelCol(labelCol).\n setRawPredictionCol(probabilityCol)\n evaluator.evaluate(dataFitted)\n }\n\n def getFeatureImportance(): Array[(Double, String)] = {\n val bestModel = checkIsFitted(tunedModelFitted).bestModel.asInstanceOf[PipelineModel]\n val classifier = bestModel.stages.last.asInstanceOf[GBTClassificationModel]\n classifier.\n featureImportances.\n toArray.\n zip(numCols ++ catCols).\n sorted.\n reverse\n }\n\n private def transform(data: DataFrame): DataFrame = {\n checkIsFitted(tunedModelFitted).transform(data)\n }\n\n private def buildTunedModelPipeline(): TrainValidationSplit = {\n // numeric encode the categorical columns;\n // standardize the numeric ones;\n // and concatenate/assemble them together\n val (numColsPipeline, numColsScaled) = standardizeNumCols()\n val (catColsPipeline, catColsIndexed) = indexCatCols()\n\n val assembledCols = Array(numColsScaled, catColsIndexed)\n val assembler = new VectorAssembler().\n setInputCols(assembledCols).\n setOutputCol(featureCol)\n\n // we could increase the maxIter if time permitting\n val gbt = new GBTClassifier().\n setFeaturesCol(assembler.getOutputCol).\n setLabelCol(labelCol).\n setPredictionCol(predictionCol).\n setProbabilityCol(probabilityCol).\n setMaxBins(50).\n setStepSize(0.1).\n setMaxIter(30).\n setSubsamplingRate(1.0)\n\n val estimator = new Pipeline().setStages(\n Array(numColsPipeline, catColsPipeline, assembler, gbt))\n\n val evaluator = AUCEvaluator.\n setLabelCol(labelCol).\n setRawPredictionCol(probabilityCol)\n\n // we could increase size of grid if time permitting\n val paramGrid = new ParamGridBuilder().\n addGrid(gbt.maxDepth, Array(5, 10)).\n build()\n\n // we could change to CrossValidator if time permitting\n val tunedModel = new TrainValidationSplit().\n setEstimator(estimator).\n setEvaluator(evaluator).\n setEstimatorParamMaps(paramGrid).\n setTrainRatio(0.8)\n tunedModel\n }\n\n private def standardizeNumCols(): (Pipeline, String) = {\n val assembler = new VectorAssembler().\n setInputCols(numCols).\n setOutputCol(\"numColsAssemblded\")\n val scaler = new StandardScaler().\n setInputCol(assembler.getOutputCol).\n setOutputCol(\"numColsScaled\").\n setWithStd(true).\n setWithMean(true)\n val numColsPipeline = new Pipeline().setStages(Array(assembler, scaler))\n\n (numColsPipeline, scaler.getOutputCol)\n }\n\n private def indexCatCols(): (Pipeline, String) = {\n // to numeric encode categorical columns, one\n // has to create a StringIndexer for each column,\n // support for multiple columns is a work in progress\n // https://issues.apache.org/jira/browse/SPARK-11215\n val catColsIndexer = catCols.map { col =>\n val indexer = new StringIndexer().\n setInputCol(col).\n setOutputCol(col + \"Indexed\").\n setHandleInvalid(\"error\")\n indexer\n }\n\n val inputCols = catColsIndexer.map(indexer => indexer.getOutputCol)\n val assembler = new VectorAssembler().\n setInputCols(inputCols).\n setOutputCol(\"catColsAssembled\")\n val catColsPipeline = new Pipeline().setStages(catColsIndexer ++ Array(assembler))\n\n (catColsPipeline, assembler.getOutputCol)\n }\n }\n}","user":"anonymous","dateUpdated":"2018-03-11T22:25:55-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"defined object MLUtils\n"}]},"apps":[],"jobName":"paragraph_1520100746594_-1671521668","id":"20180303-101226_1748393236","dateCreated":"2018-03-03T10:12:26-0800","dateStarted":"2018-03-11T22:26:06-0700","dateFinished":"2018-03-11T22:26:09-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3376"},{"text":"val model = new MLUtils.TunedModelPipeline(numCols, catCols, labelCol)\nmodel.fit(dfTrain)","user":"anonymous","dateUpdated":"2018-03-11T22:25:56-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"model: MLUtils.TunedModelPipeline = MLUtils$TunedModelPipeline@655726a7\nres73: model.type = MLUtils$TunedModelPipeline@655726a7\n"}]},"apps":[],"jobName":"paragraph_1520107722223_-1261587375","id":"20180303-120842_627692451","dateCreated":"2018-03-03T12:08:42-0800","dateStarted":"2018-03-11T22:26:09-0700","dateFinished":"2018-03-11T22:27:22-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3377"},{"text":"// every crude model sanity check, yes, i know it's better to use logging ...\nprintln(\"Model Feature Importance:\")\nmodel.getFeatureImportance.foreach(println)","user":"anonymous","dateUpdated":"2018-03-11T22:25:56-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"Model Feature Importance:\n(0.13623476335588416,age)\n(0.13016727589742866,capital_gain)\n(0.12682246562237498,marital_status)\n(0.10820988630277954,occupation)\n(0.10495253060784886,capital_loss)\n(0.07531670717567246,hours_per_week)\n(0.07185923604647733,education)\n(0.06483668588069878,native_country)\n(0.05309737578727951,education_num)\n(0.05219337219645827,relationship)\n(0.04249486070534674,workclass)\n(0.021579430216461234,fnlwgt)\n(0.011454254960404133,sex)\n(7.811552448853516E-4,race)\n"}]},"apps":[],"jobName":"paragraph_1520136986935_-441538334","id":"20180303-201626_266343429","dateCreated":"2018-03-03T20:16:26-0800","dateStarted":"2018-03-11T22:26:09-0700","dateFinished":"2018-03-11T22:27:23-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3378"},{"text":"println(\"Test AUC Score:\")\nprintln(model.score(dfTest))","user":"anonymous","dateUpdated":"2018-03-11T22:25:56-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"Test AUC Score:\n0.9147940158924898\n"}]},"apps":[],"jobName":"paragraph_1520112540794_1128506656","id":"20180303-132900_155756670","dateCreated":"2018-03-03T13:29:00-0800","dateStarted":"2018-03-11T22:27:23-0700","dateFinished":"2018-03-11T22:27:23-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3379"},{"text":"// saving the prediction to a HDFS path, note that when\n// we save to a HDFS path, it is saved as a directory, and if\n// the process of writing the output is successful then there\n// will be a _SUCCESS file under the directory\nmodel.predict(dfTest).write.save(\"hdfs:/user/ethen/prediction\")","user":"anonymous","dateUpdated":"2018-03-11T22:25:56-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[]},"apps":[],"jobName":"paragraph_1520724877748_-885706378","id":"20180310-153437_428457381","dateCreated":"2018-03-10T15:34:37-0800","dateStarted":"2018-03-11T22:27:23-0700","dateFinished":"2018-03-11T22:27:25-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3380"},{"text":"%md\n\n```bash\n# we can list what's under the directory\noutputdir=/user/ethen/prediction\nhdfs dfs -ls $outputdir\n\n# merge the result into a single parquet file \nhdfs dfs -getmerge $outputdir prediction.parquet\n\n# delete the directory if we're done with it\nhdfs dfs -rm -r $outputdir\n```\n\nWe can use the parquet file in a downstream python script\n\n```python\nimport pandas as pd\n\ndata = pd.read_parquet('prediction.parquet')\n```","user":"anonymous","dateUpdated":"2018-03-11T22:25:56-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<pre><code class=\"bash\"># we can list what's under the directory\noutputdir=/user/ethen/prediction\nhdfs dfs -ls $outputdir\n\n# merge the result into a single parquet file \nhdfs dfs -getmerge $outputdir prediction.parquet\n\n# delete the directory if we're done with it\nhdfs dfs -rm -r $outputdir\n</code></pre>\n<p>We can use the parquet file in a downstream python script</p>\n<pre><code class=\"python\">import pandas as pd\n\ndata = pd.read_parquet('prediction.parquet')\n</code></pre>\n</div>"}]},"apps":[],"jobName":"paragraph_1520725195935_1483621676","id":"20180310-153955_1725721458","dateCreated":"2018-03-10T15:39:55-0800","dateStarted":"2018-03-11T22:25:56-0700","dateFinished":"2018-03-11T22:25:56-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3381"},{"text":"%md\n\n## Reference\n\n- [Spark ScalaDoc](http://spark.apache.org/docs/latest/api/scala/index.html#package)\n- [GitBook: Mastering Apache Spark 2](https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details)\n- [Spark Documentation: ML Pipelines](https://spark.apache.org/docs/latest/ml-pipeline.html)\n- [DataBricks Documentation: Binary Classification Example](https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html)","user":"anonymous","dateUpdated":"2018-03-11T22:25:56-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<h2>Reference</h2>\n<ul>\n <li><a href=\"http://spark.apache.org/docs/latest/api/scala/index.html#package\">Spark ScalaDoc</a></li>\n <li><a href=\"https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details\">GitBook: Mastering Apache Spark 2</a></li>\n <li><a href=\"https://spark.apache.org/docs/latest/ml-pipeline.html\">Spark Documentation: ML Pipelines</a></li>\n <li><a href=\"https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html\">DataBricks Documentation: Binary Classification Example</a></li>\n</ul>\n</div>"}]},"apps":[],"jobName":"paragraph_1520115457212_-1563416121","id":"20180303-141737_1376676292","dateCreated":"2018-03-03T14:17:37-0800","dateStarted":"2018-03-11T22:25:56-0700","dateFinished":"2018-03-11T22:25:56-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:3382"}],"name":"sparkml","id":"2D94T8JSY","angularObjects":{"2D6F3C8XC:shared_process":[],"2D6U8BKVC:shared_process":[],"2D8X57GY1:shared_process":[],"2D7P7W9B8:shared_process":[],"2D63DK9YV:shared_process":[],"2D86PDBSB:shared_process":[],"2D7Z6Q7D5:shared_process":[],"2D5QJMW6N:shared_process":[],"2D5K9G6DH:shared_process":[],"2D6YWRYKW:shared_process":[],"2D61GKS3Q:shared_process":[],"2D6X26XAN:shared_process":[],"2D71XSS46:shared_process":[],"2D8W9XG6T:shared_process":[],"2D5UM6MWZ:shared_process":[],"2D5JYXUJP:shared_process":[],"2D83P7E7G:shared_process":[],"2D5T5NCSP:shared_process":[],"2D8UXQQ2S:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}