Skip to content

[SPARK-12182][ML] Distributed binning for trees in spark.ml #10231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

sethah
Copy link
Contributor

@sethah sethah commented Dec 9, 2015

This PR changes the findSplits method in spark.ml to perform split calculations on the workers. This PR is meant to copy PR-8246 which added the same feature for MLlib.

@sethah
Copy link
Contributor Author

sethah commented Dec 9, 2015

@NathanHowell would you be able to review this?

cc @jkbradley

@NathanHowell
Copy link

Yeah I can take a look tonight or tomorrow
On Dec 9, 2015 14:25, "Seth Hendrickson" notifications@github.com wrote:

@NathanHowell https://github.com/NathanHowell would you be able to
review this?

cc @jkbradley https://github.com/jkbradley


Reply to this email directly or view it on GitHub
#10231 (comment).

@SparkQA
Copy link

SparkQA commented Dec 9, 2015

Test build #47451 has finished for PR 10231 at commit 8f06b34.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.groupByKey(numPartitions)
.map { case (idx, samples) =>
val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
val splits: Array[Split] = thresholds.map(thresh => new ContinuousSplit(idx, thresh))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as mentioned in jenkins): scala style long line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@holdenk
Copy link
Contributor

holdenk commented Dec 9, 2015

At first glance this seems to share a lot of code with the original implementation in MLLib (they both even work with RDDs of LabeledPoints) - maybe we could move much of this to a common util class or similar?

@sethah
Copy link
Contributor Author

sethah commented Dec 9, 2015

This JIRA was actually created as a blocker JIRA for SPARK-12183 which is for removing the MLlib code entirely and wrapping to spark.ml. So, the code duplication should be very short-lived.

@holdenk
Copy link
Contributor

holdenk commented Dec 9, 2015

Ah great - if were killing the old code soon then no worries on the temporary duplication.

@SparkQA
Copy link

SparkQA commented Dec 9, 2015

Test build #47453 has finished for PR 10231 at commit 6c4ba6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Unordered features
// 2^(maxFeatureValue - 1) - 1 combinations
val featureArity = metadata.featureArity(i)
val split: IndexedSeq[Split] = Range(0, metadata.numSplits(i)).map { splitIndex =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use an Array.tablulate here. Something like

Array.tabulate[Split](numSplits(i)){splitIndex =>
...
}

This avoids allocating two collections, one for the splits range and the other for splits.toArray.
Also note that the type parameter [Split] is required here. This is because the compiler would otherwise infer an Array[CategoricalSplit] as return type which, because arrays are not covariant, is not a subtype of Array[Split] and would thus not compile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for the suggestion!

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #47583 has finished for PR 10231 at commit c34075b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Jan 4, 2016

@NathanHowell do you think you'll have any time to take a look at this?

@NathanHowell
Copy link

@sethah looks good to me. 👍

@sethah
Copy link
Contributor Author

sethah commented Jan 6, 2016

@NathanHowell Thank you for reviewing!

@jkbradley
Copy link
Member

Would you have time to test this on a small dataset? The original PR confirmed it's faster for a larger dataset, but I'm curious if it affects timing (adversely) on small data.

@sethah
Copy link
Contributor Author

sethah commented Mar 16, 2016

I can set something up. Do you have a specific dataset size in mind or even a specific dataset?

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #2645 has finished for PR 10231 at commit c34075b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

No specific dataset size. I was thinking of something in this ballpark:

  • 1K-10K rows
  • 10-100 columns
  • maxDepth 1 - 2 (shallow tree to avoid amortizing cost of choosing splits)

Thanks!

metadata: DecisionTreeMetadata,
continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {

val continuousSplits = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put type here for code clarity

@jkbradley
Copy link
Member

Could you also make this change: [https://github.com//pull/8246/files#diff-8ad842a043888473bb2b527e818de04bR645]

Done with pass. I added a few minor comments which weren't in the spark.mllib PR.

@sethah
Copy link
Contributor Author

sethah commented Mar 18, 2016

@jkbradley I ran some local timings comparing before/after this change. I used RandomForestRegressor with all continuous features. It looks like there is a small performance impact on micro datasets, but no noticeable performance hit on larger in-memory datasets. What do you think?

I just ran five trials each, but I can set up something more robust if needed.

options = {'numRows': 10k, 'numCols': 100, 'maxDepth': 2}
   with_patch  without_patch
0    0.991490       0.778417
1    0.867575       0.862355
2    0.894913       0.987718
3    0.920691       0.790363
4    0.933628       0.951237
options = {'numRows': 1k, 'numCols': 10, 'maxDepth': 2}
   with_patch  without_patch
0    0.038660       0.015930
1    0.051568       0.015814
2    0.039481       0.018386
3    0.044415       0.016335
4    0.049889       0.017497

@jkbradley
Copy link
Member

That does not seem that bad. I'd say we should go ahead with your PR. If we want to optimize for small data, we can add a local implementation at some point. (But that's far-future.)

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53552 has finished for PR 10231 at commit a847bc9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53553 has finished for PR 10231 at commit af3559a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53555 has finished for PR 10231 at commit d8a4c77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53591 has finished for PR 10231 at commit c9bec20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -956,7 +956,7 @@ private[ml] object RandomForest extends Logging {
valueCounts.map(_._1)
} else {
// stride between splits
val stride: Double = featureSamples.length.toDouble / (numSplits + 1)
val stride: Double = featureSamples.size.toDouble / (numSplits + 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will do a second pass over the Iterable. Would it be preferable to combine this into the foldLeft above so it only does a single pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! The latest commit should take care of it.

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53595 has finished for PR 10231 at commit 8f5077f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Thanks @sethah for the PR and @NathanHowell for reviewing!
Merging with master

@asfgit asfgit closed this in 811a524 Mar 20, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](apache#8246) which added the same feature for MLlib.

Author: sethah <seth.hendrickson16@gmail.com>

Closes apache#10231 from sethah/SPARK-12182.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants