[SPARK-13784][ML] Persistence for RandomForestClassifier, RandomForestRegressor #12118

jkbradley · 2016-04-01T22:13:22Z

What changes were proposed in this pull request?

Main change: Added save/load for RandomForestClassifier, RandomForestRegressor (implementation details below)

Modified numTrees method (deprecation)

Goal: Use default implementations of unit tests which assume Estimators and Models share the same set of Params.
What this PR does: Moves method numTrees outside of trait TreeEnsembleModel. Adds it to GBT and RF Models. Deprecates it in RF Models in favor of new method getNumTrees. In Spark 2.1, we can have RF Models include Param numTrees.

Minor items

Fixes bugs in GBTClassificationModel, GBTRegressionModel fromOld methods where they assign the wrong old UID.

Implementation details

Split DecisionTreeModelReadWrite.loadTreeNodes into 2 methods in order to reuse some code for ensembles.
Added EnsembleModelReadWrite object with save/load implementations usable for RFs and GBTs
- These store all trees' nodes in a single DataFrame, and all trees' metadata in a second DataFrame.
Split trait RandomForestParams into parts in order to add more Estimator Params to RF models
Split DefaultParamsWriter.saveMetadata into two methods to allow ensembles to store sub-models' metadata in a single DataFrame. Same for DefaultParamsReader.loadMetadata

How was this patch tested?

Adds standard unit tests for RF save/load

…ated current numTrees val

SparkQA · 2016-04-01T22:18:55Z

Test build #54727 has finished for PR 12118 at commit 607ed4e.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

jkbradley · 2016-04-01T22:23:19Z

@GayathriMurali @hhbyyh Could you please help review this PR? Thanks!

SparkQA · 2016-04-01T22:36:55Z

Test build #54730 has finished for PR 12118 at commit cd07aff.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-01T23:00:24Z

That seems like a spurious or unrelated failure. I'll retest

SparkQA · 2016-04-01T23:03:15Z

Test build #54728 has finished for PR 12118 at commit b13b11f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-01T23:41:29Z

Test build #2727 has finished for PR 12118 at commit cd07aff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

GayathriMurali · 2016-04-02T02:49:44Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala

+    sql.createDataFrame(treesMetadataJson).toDF("treeID", "metadata")
+      .write.parquet(treesMetadataPath)
+    val dataPath = new Path(path, "data").toString
+    val nodeDataRDD = sql.sparkContext.parallelize(instance.trees.zipWithIndex).flatMap {


Is it alright to use flatMap to combine RDDs? Can we use sparkContext.union instead?

This is a single RDD. The flatMap maps every element of the original RDD to multiple elements in a new RDD. It should be fine.

GayathriMurali · 2016-04-02T03:05:11Z

@jkbradley Thanks for this. This looks great and clarifies a lot of things I was trying to do. I had one minor comment, except that it looks fine to me.

jkbradley · 2016-04-02T17:32:02Z

@GayathriMurali Thanks for taking a look! @hhbyyh It would be great if you could take a look too since you're familiar with some of this code.

hhbyyh · 2016-04-03T14:34:26Z

Sorry for the delay. I'll start now.

hhbyyh · 2016-04-03T16:15:15Z

mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala

 }

 /**
 * :: Experimental ::
 * [[http://en.wikipedia.org/wiki/Random_forest  Random Forest]] model for regression.
 * It supports both continuous and categorical features.
- * @param _trees  Decision trees in the ensemble.
+  *


indent correctly?
Similar for a few lines next.

hhbyyh · 2016-04-03T16:32:31Z

just some minor comments. LGTM.

SparkQA · 2016-04-04T06:21:48Z

Test build #54828 has finished for PR 12118 at commit afed67b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ARK-13784

SparkQA · 2016-04-04T07:35:30Z

Test build #54836 has finished for PR 12118 at commit e0306a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-04T17:22:22Z

@GayathriMurali Thanks for the initial commits, and @hhbyyh thanks for the review! I'll go ahead and merge this with master since you gave a LGTM pending minor changes.

GayathriMurali and others added 7 commits April 1, 2016 15:15

SPARK-13784 Model export/import for Spark ml RandomForests

5505fe7

SPARK-13783 Model export/import for spark.ml:RandomForests

00963f6

Implemented read/write for RandomForestClassifier, Regressor.

187f32c

PR cleanup

366655d

Moved numTrees Param outside of RandomForest*Model Params, and deprec…

0f27c3f

…ated current numTrees val

PR cleanups

80fea0a

cleanup

b13b11f

jkbradley force-pushed the GayathriMurali-SPARK-13784 branch from 607ed4e to b13b11f Compare April 1, 2016 22:18

scala style fixes

cd07aff

jkbradley mentioned this pull request Apr 1, 2016

[Spark-13784][ML][WIP] Model export/import for spark.ml: RandomForests #12023

Closed

GayathriMurali reviewed Apr 2, 2016
View reviewed changes

hhbyyh reviewed Apr 3, 2016
View reviewed changes

Updates per code review

afed67b

jkbradley added 2 commits April 3, 2016 23:24

Merge remote-tracking branch 'upstream/master' into GayathriMurali-SP…

e35aa2e

…ARK-13784

Fixed merge

e0306a9

asfgit closed this in 89f3bef Apr 4, 2016

jkbradley deleted the GayathriMurali-SPARK-13784 branch April 4, 2016 18:57

[SPARK-13784][ML] Persistence for RandomForestClassifier, RandomForestRegressor #12118

[SPARK-13784][ML] Persistence for RandomForestClassifier, RandomForestRegressor #12118

Uh oh!

Conversation

jkbradley commented Apr 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

jkbradley commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

jkbradley commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

GayathriMurali Apr 2, 2016

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 2, 2016

Choose a reason for hiding this comment

Uh oh!

GayathriMurali commented Apr 2, 2016

Uh oh!

jkbradley commented Apr 2, 2016

Uh oh!

hhbyyh commented Apr 3, 2016

Uh oh!

hhbyyh Apr 3, 2016

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Apr 3, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

jkbradley commented Apr 4, 2016

Uh oh!

Uh oh!