[SPARK-12182][ML] Distributed binning for trees in spark.ml #10231

sethah · 2015-12-09T22:23:23Z

This PR changes the findSplits method in spark.ml to perform split calculations on the workers. This PR is meant to copy PR-8246 which added the same feature for MLlib.

sethah · 2015-12-09T22:24:23Z

@NathanHowell would you be able to review this?

cc @jkbradley

NathanHowell · 2015-12-09T22:36:16Z

Yeah I can take a look tonight or tomorrow
On Dec 9, 2015 14:25, "Seth Hendrickson" notifications@github.com wrote:

@NathanHowell https://github.com/NathanHowell would you be able to
review this?

cc @jkbradley https://github.com/jkbradley

—
Reply to this email directly or view it on GitHub
#10231 (comment).

SparkQA · 2015-12-09T22:38:19Z

Test build #47451 has finished for PR 10231 at commit 8f06b34.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-12-09T22:48:22Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+        .groupByKey(numPartitions)
+        .map { case (idx, samples) =>
+          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
+          val splits: Array[Split] = thresholds.map(thresh => new ContinuousSplit(idx, thresh))


(as mentioned in jenkins): scala style long line

holdenk · 2015-12-09T22:57:29Z

At first glance this seems to share a lot of code with the original implementation in MLLib (they both even work with RDDs of LabeledPoints) - maybe we could move much of this to a common util class or similar?

sethah · 2015-12-09T23:05:14Z

This JIRA was actually created as a blocker JIRA for SPARK-12183 which is for removing the MLlib code entirely and wrapping to spark.ml. So, the code duplication should be very short-lived.

holdenk · 2015-12-09T23:17:33Z

Ah great - if were killing the old code soon then no worries on the temporary duplication.

SparkQA · 2015-12-09T23:35:46Z

Test build #47453 has finished for PR 10231 at commit 6c4ba6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jodersky · 2015-12-10T19:46:21Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+        // Unordered features
+        // 2^(maxFeatureValue - 1) - 1 combinations
+        val featureArity = metadata.featureArity(i)
+        val split: IndexedSeq[Split] = Range(0, metadata.numSplits(i)).map { splitIndex =>


You could use an Array.tablulate here. Something like

Array.tabulate[Split](numSplits(i)){splitIndex => ... }

This avoids allocating two collections, one for the splits range and the other for splits.toArray.
Also note that the type parameter [Split] is required here. This is because the compiler would otherwise infer an Array[CategoricalSplit] as return type which, because arrays are not covariant, is not a subtype of Array[Split] and would thus not compile

Done. Thanks for the suggestion!

SparkQA · 2015-12-11T17:34:00Z

Test build #47583 has finished for PR 10231 at commit c34075b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-01-04T23:40:41Z

@NathanHowell do you think you'll have any time to take a look at this?

NathanHowell · 2016-01-06T17:16:18Z

@sethah looks good to me. 👍

sethah · 2016-01-06T17:24:56Z

@NathanHowell Thank you for reviewing!

jkbradley · 2016-03-16T23:06:10Z

Would you have time to test this on a small dataset? The original PR confirmed it's faster for a larger dataset, but I'm curious if it affects timing (adversely) on small data.

sethah · 2016-03-16T23:12:57Z

I can set something up. Do you have a specific dataset size in mind or even a specific dataset?

SparkQA · 2016-03-16T23:50:09Z

Test build #2645 has finished for PR 10231 at commit c34075b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-17T00:21:17Z

No specific dataset size. I was thinking of something in this ballpark:

1K-10K rows
10-100 columns
maxDepth 1 - 2 (shallow tree to avoid amortizing cost of choosing splits)

Thanks!

jkbradley · 2016-03-17T00:45:59Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+      metadata: DecisionTreeMetadata,
+      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
+
+    val continuousSplits = {


Put type here for code clarity

jkbradley · 2016-03-17T00:48:39Z

Could you also make this change: [https://github.com//pull/8246/files#diff-8ad842a043888473bb2b527e818de04bR645]

Done with pass. I added a few minor comments which weren't in the spark.mllib PR.

sethah · 2016-03-18T18:25:23Z

@jkbradley I ran some local timings comparing before/after this change. I used RandomForestRegressor with all continuous features. It looks like there is a small performance impact on micro datasets, but no noticeable performance hit on larger in-memory datasets. What do you think?

I just ran five trials each, but I can set up something more robust if needed.

options = {'numRows': 10k, 'numCols': 100, 'maxDepth': 2}
   with_patch  without_patch
0    0.991490       0.778417
1    0.867575       0.862355
2    0.894913       0.987718
3    0.920691       0.790363
4    0.933628       0.951237

options = {'numRows': 1k, 'numCols': 10, 'maxDepth': 2}
   with_patch  without_patch
0    0.038660       0.015930
1    0.051568       0.015814
2    0.039481       0.018386
3    0.044415       0.016335
4    0.049889       0.017497

jkbradley · 2016-03-18T18:55:48Z

That does not seem that bad. I'd say we should go ahead with your PR. If we want to optimize for small data, we can add a local implementation at some point. (But that's far-future.)

SparkQA · 2016-03-18T19:14:01Z

Test build #53552 has finished for PR 10231 at commit a847bc9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-18T19:18:55Z

Test build #53553 has finished for PR 10231 at commit af3559a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-18T20:03:47Z

Test build #53555 has finished for PR 10231 at commit d8a4c77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-19T02:51:58Z

Test build #53591 has finished for PR 10231 at commit c9bec20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NathanHowell · 2016-03-19T02:57:17Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

@@ -956,7 +956,7 @@ private[ml] object RandomForest extends Logging {
        valueCounts.map(_._1)
      } else {
        // stride between splits
-        val stride: Double = featureSamples.length.toDouble / (numSplits + 1)
+        val stride: Double = featureSamples.size.toDouble / (numSplits + 1)


This will do a second pass over the Iterable. Would it be preferable to combine this into the foldLeft above so it only does a single pass?

Thanks for the suggestion! The latest commit should take care of it.

SparkQA · 2016-03-19T04:08:33Z

Test build #53595 has finished for PR 10231 at commit 8f5077f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-20T19:30:02Z

LGTM
Thanks @sethah for the PR and @NathanHowell for reviewing!
Merging with master

This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](apache#8246) which added the same feature for MLlib. Author: sethah <seth.hendrickson16@gmail.com> Closes apache#10231 from sethah/SPARK-12182.

sethah added 3 commits December 8, 2015 15:08

distributed findsplits for spark ml

16d936e

removing helper function

2a43833

type ascription for splits

8f06b34

style cleanup

6c4ba6f

holdenk reviewed Dec 9, 2015
View reviewed changes

jodersky reviewed Dec 10, 2015
View reviewed changes

array.tabulate

c34075b

jkbradley reviewed Mar 17, 2016
View reviewed changes

addressing comments

a847bc9

addressing redundant array cast

af3559a

removing debug statement

d8a4c77

removing unnecessary array cast

c9bec20

NathanHowell reviewed Mar 19, 2016
View reviewed changes

addressing comment on foldLeft

8f5077f

asfgit closed this in 811a524 Mar 20, 2016

[SPARK-12182][ML] Distributed binning for trees in spark.ml #10231

[SPARK-12182][ML] Distributed binning for trees in spark.ml #10231

Uh oh!

Conversation

sethah commented Dec 9, 2015

Uh oh!

sethah commented Dec 9, 2015

Uh oh!

NathanHowell commented Dec 9, 2015

Uh oh!

SparkQA commented Dec 9, 2015

Uh oh!

holdenk Dec 9, 2015

Choose a reason for hiding this comment

Uh oh!

sethah Dec 9, 2015

Choose a reason for hiding this comment

Uh oh!

holdenk commented Dec 9, 2015

Uh oh!

sethah commented Dec 9, 2015

Uh oh!

holdenk commented Dec 9, 2015

Uh oh!

SparkQA commented Dec 9, 2015

Uh oh!

jodersky Dec 10, 2015

Choose a reason for hiding this comment

Uh oh!

sethah Dec 11, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 11, 2015

Uh oh!

sethah commented Jan 4, 2016

Uh oh!

NathanHowell commented Jan 6, 2016

Uh oh!

sethah commented Jan 6, 2016

Uh oh!

jkbradley commented Mar 16, 2016

Uh oh!

sethah commented Mar 16, 2016

Uh oh!

SparkQA commented Mar 16, 2016

Uh oh!

jkbradley commented Mar 17, 2016

Uh oh!

jkbradley Mar 17, 2016

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Mar 17, 2016

Uh oh!

sethah commented Mar 18, 2016

Uh oh!

jkbradley commented Mar 18, 2016

Uh oh!

SparkQA commented Mar 18, 2016

Uh oh!

SparkQA commented Mar 18, 2016

Uh oh!

SparkQA commented Mar 18, 2016

Uh oh!

SparkQA commented Mar 19, 2016

Uh oh!

NathanHowell Mar 19, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Mar 19, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 19, 2016

Uh oh!

jkbradley commented Mar 20, 2016

Uh oh!

Uh oh!