Skip to content

[SPARK-13784][ML] Persistence for RandomForestClassifier, RandomForestRegressor #12118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

jkbradley
Copy link
Member

What changes were proposed in this pull request?

Main change: Added save/load for RandomForestClassifier, RandomForestRegressor (implementation details below)

Modified numTrees method (deprecation)

  • Goal: Use default implementations of unit tests which assume Estimators and Models share the same set of Params.
  • What this PR does: Moves method numTrees outside of trait TreeEnsembleModel. Adds it to GBT and RF Models. Deprecates it in RF Models in favor of new method getNumTrees. In Spark 2.1, we can have RF Models include Param numTrees.

Minor items

  • Fixes bugs in GBTClassificationModel, GBTRegressionModel fromOld methods where they assign the wrong old UID.

Implementation details

  • Split DecisionTreeModelReadWrite.loadTreeNodes into 2 methods in order to reuse some code for ensembles.
  • Added EnsembleModelReadWrite object with save/load implementations usable for RFs and GBTs
    • These store all trees' nodes in a single DataFrame, and all trees' metadata in a second DataFrame.
  • Split trait RandomForestParams into parts in order to add more Estimator Params to RF models
  • Split DefaultParamsWriter.saveMetadata into two methods to allow ensembles to store sub-models' metadata in a single DataFrame. Same for DefaultParamsReader.loadMetadata

How was this patch tested?

Adds standard unit tests for RF save/load

@jkbradley jkbradley force-pushed the GayathriMurali-SPARK-13784 branch from 607ed4e to b13b11f Compare April 1, 2016 22:18
@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54727 has finished for PR 12118 at commit 607ed4e.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

@GayathriMurali @hhbyyh Could you please help review this PR? Thanks!

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54730 has finished for PR 12118 at commit cd07aff.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

That seems like a spurious or unrelated failure. I'll retest

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54728 has finished for PR 12118 at commit b13b11f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #2727 has finished for PR 12118 at commit cd07aff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sql.createDataFrame(treesMetadataJson).toDF("treeID", "metadata")
.write.parquet(treesMetadataPath)
val dataPath = new Path(path, "data").toString
val nodeDataRDD = sql.sparkContext.parallelize(instance.trees.zipWithIndex).flatMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it alright to use flatMap to combine RDDs? Can we use sparkContext.union instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a single RDD. The flatMap maps every element of the original RDD to multiple elements in a new RDD. It should be fine.

@GayathriMurali
Copy link
Contributor

@jkbradley Thanks for this. This looks great and clarifies a lot of things I was trying to do. I had one minor comment, except that it looks fine to me.

@jkbradley
Copy link
Member Author

@GayathriMurali Thanks for taking a look! @hhbyyh It would be great if you could take a look too since you're familiar with some of this code.

@hhbyyh
Copy link
Contributor

hhbyyh commented Apr 3, 2016

Sorry for the delay. I'll start now.

}

/**
* :: Experimental ::
* [[http://en.wikipedia.org/wiki/Random_forest Random Forest]] model for regression.
* It supports both continuous and categorical features.
* @param _trees Decision trees in the ensemble.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent correctly?
Similar for a few lines next.

@hhbyyh
Copy link
Contributor

hhbyyh commented Apr 3, 2016

just some minor comments. LGTM.

@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54828 has finished for PR 12118 at commit afed67b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54836 has finished for PR 12118 at commit e0306a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

@GayathriMurali Thanks for the initial commits, and @hhbyyh thanks for the review! I'll go ahead and merge this with master since you gave a LGTM pending minor changes.

@asfgit asfgit closed this in 89f3bef Apr 4, 2016
@jkbradley jkbradley deleted the GayathriMurali-SPARK-13784 branch April 4, 2016 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants