Skip to content

[SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... #2868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

codedeft
Copy link

...sion trees. @jkbradley @mengxr @chouqin Please review this.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dbtsai
Copy link
Member

dbtsai commented Oct 21, 2014

Jenkins, please start the test!

@mengxr
Copy link
Contributor

mengxr commented Oct 21, 2014

Jenkins, add to whitelist.

@mengxr
Copy link
Contributor

mengxr commented Oct 21, 2014

test this please

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2868 at commit 9ea76df.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2868 at commit 9ea76df.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21959/
Test FAILed.

@codedeft
Copy link
Author

Seems like lots of line too long messages. Will address this.

@codedeft
Copy link
Author

test this please

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2868 at commit 6b05af0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2868 at commit 13585e8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2868 at commit 6b05af0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21965/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2868 at commit 13585e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class JavaFutureActionWrapper[S, T](futureAction: FutureAction[S], converter: S => T)
    • class SerializableMapWrapper[A, B](underlying: collection.Map[A, B])
    • case class ReconnectWorker(masterUrl: String) extends DeployMessage
    • class Predict(
    • case class EvaluatePython(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21966/
Test PASSed.

bins: Array[Array[Bin]]): Unit = {
val updatedRDD = data.zip(cur).map {
dataPoint => {
cfor(0)(_ < nodeIdUpdaters.length, _ + 1)(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use a while loop to do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second that, especially if it eliminates the dependence on spire (since spire is not used elsewhere in Spark).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I used it because spire was included somehow (maybe one of the dependent packages use it).

@chouqin
Copy link
Contributor

chouqin commented Oct 21, 2014

@codedeft Thanks for your nice work. I have added some comments inline. Here are some high level comments:

  1. Have you tested the performance after this change?As discussed in SPARK-3161, This will help little for shallow trees. Then how much performance gain will this change give for deep trees? If it gives much gain, I think we should add more unit test for this option and refactor the code to address code reuse(for example, there are some duplication between binSeqOp and binSeqOpWithNodeIdCache)
  2. Is checkpoint really necessary to avoid long lineage? Maybe my understanding is not right, to my knowledge, each time we do aggregation, the nodeIdCache will be computed. If we persist it into disk(using persist(StorageLevel.MEMORY_AND_DISK)) and unpersist it if it is not needed anymore, then is will persist in the disk each time it gets computed. We can remove checkpoint and make the code cleaner and faster(checkpoint will persist RDD into a distributed file system).

@jkbradley
Copy link
Member

@chouqin Checkpointing is helpful since it is more persistent than persist(). Checkpointing stores data to HDFS (with replication), so that the RDD is stored even if a worker dies. With persist(), part of the RDD will be lost when a worker dies. For my big decision tree tests, I do see EC2 workers die periodically (though not that often), and I am sure it is a bigger issue for corporate (big) clusters.

@jkbradley
Copy link
Member

@codedeft Thanks for the PR! I'll make a pass now.

* @param data The RDD of training rows.
* @param cur The initial values in the cache
* (should be an Array of all 1's (meaning the root nodes)).
* @param checkpointDir The checkpoint directory where
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this skips checkpointing if checkpointDir == None. However, a user could set the SparkContext checkpointDir before calling DecisionTree. Can the behavior be changed as follows:

  • If a checkpointDir is given here, then it should overwrite any preset checkpointDir in SparkContext.
  • If no checkpointDir is given, then the code should check the SparkContext (via cur.sparkContext.getCheckpointDir) to see if one has already been set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@codedeft
Copy link
Author

I've addressed the comments. Please review at your convenience. I'll publish some big data results once they are actually done.

Thanks!

@jkbradley
Copy link
Member

Your test analysis is pretty convincing! Keeping the PR sounds good.

* A given TreePoint would belong to a particular node per tree.
* This is used to keep track of which node for a particular tree that a TreePoint belongs to.
* A separate RDD of Array[Int] needs to be maintained and updated at each iteration.
* @param nodeIdsForInstances The initial values in the cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit unclear; perhaps update to: "For each TreePoint, an array over trees of the node index in each tree. (Initially, values should all be 1 for root node.)"

@jkbradley
Copy link
Member

@codedeft I added 2 small comments. Other than that, it LGTM. Thanks for the PR!
CC: @mengxr

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22639 has started for PR 2868 at commit a078fc8.

  • This patch merges cleanly.

@jkbradley
Copy link
Member

LGTM Thanks!

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22639 has finished for PR 2868 at commit a078fc8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22639/
Test FAILed.

@jkbradley
Copy link
Member

Some sort of YARN failure.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #498 has started for PR 2868 at commit a078fc8.

  • This patch merges cleanly.

@codedeft
Copy link
Author

Yea, I'm also getting Yarn compilation failure on my machine after doing the latest pull. Is this happening to everyone?

@jkbradley
Copy link
Member

Yep, apparently so, but someone's working on fixing it ASAP

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #498 has finished for PR 2868 at commit a078fc8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@codedeft
Copy link
Author

codedeft commented Nov 1, 2014

The conflict is caused by the GBoosting check-in. I'm taking a look.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22684 has started for PR 2868 at commit 5f5a156.

  • This patch merges cleanly.

@codedeft
Copy link
Author

codedeft commented Nov 1, 2014

@mengxr @jkbradley Can you merge this? This is the only way you can effectively train 10 large trees with the mnist8m dataset.

With node Id cache, it took a very long time, but we were able to finish training 10 trees on mnist8m in 15 hours with 20 executors. SF with local training can finish this in 20 minutes, so local training would be a must in the next release.

However, without node Id cache, it looks like it's not even possible. It's currently only 60% of the way there and it's already taken 13 hours and dozens of fetch failures. I feel that it might eventually just fail because the models are just too big to pass around.

@manishamde
Copy link
Contributor

@CodeLeft I agree that local training should be a high priority. Just curious -- what's the depth of the tree in the failing case?

I vote for merging this PR since there is no loss in performance for shallow trees and gain in performance for deep trees.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22684 has finished for PR 2868 at commit 5f5a156.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22684/
Test PASSed.

@CodeLeft
Copy link

CodeLeft commented Nov 1, 2014

@manishamde the person you want to respond to is @codedeft. I'm not involved with this project. Our names are close, but off by one letter. Sorry for the intrusion, I'll see myself out.

@manishamde
Copy link
Contributor

@CodeLeft I am so sorry.

@codedeft
Copy link
Author

codedeft commented Nov 1, 2014

It finally finished.

10 Trees, 30 depth limit. mnist8m, 20 executors:

15 hours with node Id cache.
21 hours without node Id cache.

@mengxr
Copy link
Contributor

mengxr commented Nov 2, 2014

I've merged this into master. Thanks @codedeft adding node id caching, and @chouqin @manishamde @jkbradley for reviewing the code!

codedeft pushed a commit that referenced this pull request Nov 2, 2014
…eci...

...sion trees. jkbradley mengxr chouqin Please review this.

Author: Sung Chung <schung@alpinenow.com>

Closes #2868 from codedeft/SPARK-3161 and squashes the following commits:

5f5a156 [Sung Chung] [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training decision trees.
@mengxr
Copy link
Contributor

mengxr commented Nov 2, 2014

@codedeft The merge script didn't close this PR automatically. Could you help close it? Thanks!

@codedeft codedeft closed this Nov 2, 2014
@dbtsai dbtsai deleted the SPARK-3161 branch January 14, 2015 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants