[SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... #2868

codedeft · 2014-10-21T02:02:44Z

...sion trees. @jkbradley @mengxr @chouqin Please review this.

AmplabJenkins · 2014-10-21T02:07:11Z

Can one of the admins verify this patch?

dbtsai · 2014-10-21T02:47:48Z

Jenkins, please start the test!

mengxr · 2014-10-21T04:30:43Z

Jenkins, add to whitelist.

mengxr · 2014-10-21T04:30:51Z

test this please

SparkQA · 2014-10-21T04:34:48Z

QA tests have started for PR 2868 at commit 9ea76df.

This patch merges cleanly.

SparkQA · 2014-10-21T04:35:56Z

QA tests have finished for PR 2868 at commit 9ea76df.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-21T04:35:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21959/
Test FAILed.

codedeft · 2014-10-21T04:54:58Z

Seems like lots of line too long messages. Will address this.

codedeft · 2014-10-21T05:09:09Z

test this please

SparkQA · 2014-10-21T05:14:49Z

QA tests have started for PR 2868 at commit 6b05af0.

This patch merges cleanly.

SparkQA · 2014-10-21T05:19:57Z

QA tests have started for PR 2868 at commit 13585e8.

This patch merges cleanly.

SparkQA · 2014-10-21T06:23:43Z

QA tests have finished for PR 2868 at commit 6b05af0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-21T06:23:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21965/
Test PASSed.

SparkQA · 2014-10-21T06:29:06Z

QA tests have finished for PR 2868 at commit 13585e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JavaFutureActionWrapper[S, T](futureAction: FutureAction[S], converter: S => T)
- class SerializableMapWrapper[A, B](underlying: collection.Map[A, B])
- case class ReconnectWorker(masterUrl: String) extends DeployMessage
- class Predict(
- case class EvaluatePython(

AmplabJenkins · 2014-10-21T06:29:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21966/
Test PASSed.

chouqin · 2014-10-21T07:22:53Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala

+      bins: Array[Array[Bin]]): Unit = {
+    val updatedRDD = data.zip(cur).map {
+      dataPoint => {
+        cfor(0)(_ < nodeIdUpdaters.length, _ + 1)(


Can you use a while loop to do this?

I second that, especially if it eliminates the dependence on spire (since spire is not used elsewhere in Spark).

Will do. I used it because spire was included somehow (maybe one of the dependent packages use it).

chouqin · 2014-10-21T08:12:28Z

@codedeft Thanks for your nice work. I have added some comments inline. Here are some high level comments:

Have you tested the performance after this change?As discussed in SPARK-3161, This will help little for shallow trees. Then how much performance gain will this change give for deep trees? If it gives much gain, I think we should add more unit test for this option and refactor the code to address code reuse(for example, there are some duplication between binSeqOp and binSeqOpWithNodeIdCache)
Is checkpoint really necessary to avoid long lineage? Maybe my understanding is not right, to my knowledge, each time we do aggregation, the nodeIdCache will be computed. If we persist it into disk(using persist(StorageLevel.MEMORY_AND_DISK)) and unpersist it if it is not needed anymore, then is will persist in the disk each time it gets computed. We can remove checkpoint and make the code cleaner and faster(checkpoint will persist RDD into a distributed file system).

jkbradley · 2014-10-21T18:27:37Z

@chouqin Checkpointing is helpful since it is more persistent than persist(). Checkpointing stores data to HDFS (with replication), so that the RDD is stored even if a worker dies. With persist(), part of the RDD will be lost when a worker dies. For my big decision tree tests, I do see EC2 workers die periodically (though not that often), and I am sure it is a bigger issue for corporate (big) clusters.

jkbradley · 2014-10-21T18:33:36Z

@codedeft Thanks for the PR! I'll make a pass now.

jkbradley · 2014-10-21T19:53:51Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala

+ * @param data The RDD of training rows.
+ * @param cur The initial values in the cache
+ *            (should be an Array of all 1's (meaning the root nodes)).
+ * @param checkpointDir The checkpoint directory where


Currently, this skips checkpointing if checkpointDir == None. However, a user could set the SparkContext checkpointDir before calling DecisionTree. Can the behavior be changed as follows:

If a checkpointDir is given here, then it should overwrite any preset checkpointDir in SparkContext.

If no checkpointDir is given, then the code should check the SparkContext (via cur.sparkContext.getCheckpointDir) to see if one has already been set.

codedeft · 2014-10-30T23:47:08Z

I've addressed the comments. Please review at your convenience. I'll publish some big data results once they are actually done.

Thanks!

jkbradley · 2014-10-31T03:45:54Z

Your test analysis is pretty convincing! Keeping the PR sounds good.

jkbradley · 2014-10-31T18:51:31Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala

+ * A given TreePoint would belong to a particular node per tree.
+ * This is used to keep track of which node for a particular tree that a TreePoint belongs to.
+ * A separate RDD of Array[Int] needs to be maintained and updated at each iteration.
+ * @param nodeIdsForInstances The initial values in the cache


A bit unclear; perhaps update to: "For each TreePoint, an array over trees of the node index in each tree. (Initially, values should all be 1 for root node.)"

jkbradley · 2014-10-31T18:52:24Z

@codedeft I added 2 small comments. Other than that, it LGTM. Thanks for the PR!
CC: @mengxr

SparkQA · 2014-10-31T20:40:11Z

Test build #22639 has started for PR 2868 at commit a078fc8.

This patch merges cleanly.

jkbradley · 2014-10-31T21:24:11Z

LGTM Thanks!

SparkQA · 2014-10-31T21:34:36Z

Test build #22639 has finished for PR 2868 at commit a078fc8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-31T21:34:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22639/
Test FAILed.

jkbradley · 2014-10-31T21:37:10Z

Some sort of YARN failure.

SparkQA · 2014-10-31T21:40:09Z

Test build #498 has started for PR 2868 at commit a078fc8.

This patch merges cleanly.

codedeft · 2014-10-31T21:45:23Z

Yea, I'm also getting Yarn compilation failure on my machine after doing the latest pull. Is this happening to everyone?

jkbradley · 2014-10-31T21:57:14Z

Yep, apparently so, but someone's working on fixing it ASAP

SparkQA · 2014-10-31T22:35:01Z

Test build #498 has finished for PR 2868 at commit a078fc8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

codedeft · 2014-11-01T04:53:15Z

The conflict is caused by the GBoosting check-in. I'm taking a look.

…ecision trees.

SparkQA · 2014-11-01T05:19:49Z

Test build #22684 has started for PR 2868 at commit 5f5a156.

This patch merges cleanly.

codedeft · 2014-11-01T05:28:26Z

@mengxr @jkbradley Can you merge this? This is the only way you can effectively train 10 large trees with the mnist8m dataset.

With node Id cache, it took a very long time, but we were able to finish training 10 trees on mnist8m in 15 hours with 20 executors. SF with local training can finish this in 20 minutes, so local training would be a must in the next release.

However, without node Id cache, it looks like it's not even possible. It's currently only 60% of the way there and it's already taken 13 hours and dozens of fetch failures. I feel that it might eventually just fail because the models are just too big to pass around.

manishamde · 2014-11-01T05:41:41Z

@CodeLeft I agree that local training should be a high priority. Just curious -- what's the depth of the tree in the failing case?

I vote for merging this PR since there is no loss in performance for shallow trees and gain in performance for deep trees.

SparkQA · 2014-11-01T06:33:06Z

Test build #22684 has finished for PR 2868 at commit 5f5a156.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-01T06:33:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22684/
Test PASSed.

CodeLeft · 2014-11-01T11:50:03Z

@manishamde the person you want to respond to is @codedeft. I'm not involved with this project. Our names are close, but off by one letter. Sorry for the intrusion, I'll see myself out.

manishamde · 2014-11-01T17:06:28Z

@CodeLeft I am so sorry.

codedeft · 2014-11-01T17:17:48Z

It finally finished.

10 Trees, 30 depth limit. mnist8m, 20 executors:

15 hours with node Id cache.
21 hours without node Id cache.

mengxr · 2014-11-02T00:00:22Z

I've merged this into master. Thanks @codedeft adding node id caching, and @chouqin @manishamde @jkbradley for reviewing the code!

…eci... ...sion trees. jkbradley mengxr chouqin Please review this. Author: Sung Chung <schung@alpinenow.com> Closes #2868 from codedeft/SPARK-3161 and squashes the following commits: 5f5a156 [Sung Chung] [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training decision trees.

mengxr · 2014-11-02T04:52:52Z

@codedeft The merge script didn't close this PR automatically. Could you help close it? Thanks!

chouqin reviewed Oct 21, 2014
View reviewed changes

jkbradley reviewed Oct 21, 2014
View reviewed changes

jkbradley reviewed Oct 31, 2014
View reviewed changes

[SPARK-3161][MLLIB] Adding a node Id caching mechanism for training d…

5f5a156

…ecision trees.

codedeft closed this Nov 2, 2014

dbtsai deleted the SPARK-3161 branch January 14, 2015 22:11

[SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... #2868

[SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... #2868

Uh oh!

Conversation

codedeft commented Oct 21, 2014

Uh oh!

AmplabJenkins commented Oct 21, 2014

Uh oh!

dbtsai commented Oct 21, 2014

Uh oh!

mengxr commented Oct 21, 2014

Uh oh!

mengxr commented Oct 21, 2014

Uh oh!

SparkQA commented Oct 21, 2014

Uh oh!

SparkQA commented Oct 21, 2014

Uh oh!

AmplabJenkins commented Oct 21, 2014

Uh oh!

codedeft commented Oct 21, 2014

Uh oh!

codedeft commented Oct 21, 2014

Uh oh!

SparkQA commented Oct 21, 2014

Uh oh!

SparkQA commented Oct 21, 2014

Uh oh!

SparkQA commented Oct 21, 2014

Uh oh!

AmplabJenkins commented Oct 21, 2014

Uh oh!

SparkQA commented Oct 21, 2014

Uh oh!

AmplabJenkins commented Oct 21, 2014

Uh oh!

chouqin Oct 21, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley Oct 21, 2014

Choose a reason for hiding this comment

Uh oh!

codedeft Oct 22, 2014

Choose a reason for hiding this comment

Uh oh!

chouqin commented Oct 21, 2014

Uh oh!

jkbradley commented Oct 21, 2014

Uh oh!

jkbradley commented Oct 21, 2014

Uh oh!

jkbradley Oct 21, 2014

Choose a reason for hiding this comment

Uh oh!

codedeft Oct 22, 2014

Choose a reason for hiding this comment

Uh oh!

codedeft commented Oct 30, 2014

Uh oh!

jkbradley commented Oct 31, 2014

Uh oh!

jkbradley Oct 31, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Oct 31, 2014

Uh oh!

SparkQA commented Oct 31, 2014

Uh oh!

jkbradley commented Oct 31, 2014

Uh oh!

SparkQA commented Oct 31, 2014

Uh oh!

AmplabJenkins commented Oct 31, 2014

Uh oh!

jkbradley commented Oct 31, 2014

Uh oh!

SparkQA commented Oct 31, 2014

Uh oh!

codedeft commented Oct 31, 2014

Uh oh!