-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... #2868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
Jenkins, please start the test! |
Jenkins, add to whitelist. |
test this please |
QA tests have started for PR 2868 at commit
|
QA tests have finished for PR 2868 at commit
|
Test FAILed. |
Seems like lots of line too long messages. Will address this. |
test this please |
QA tests have started for PR 2868 at commit
|
QA tests have started for PR 2868 at commit
|
QA tests have finished for PR 2868 at commit
|
Test PASSed. |
QA tests have finished for PR 2868 at commit
|
Test PASSed. |
bins: Array[Array[Bin]]): Unit = { | ||
val updatedRDD = data.zip(cur).map { | ||
dataPoint => { | ||
cfor(0)(_ < nodeIdUpdaters.length, _ + 1)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use a while loop to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second that, especially if it eliminates the dependence on spire (since spire is not used elsewhere in Spark).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I used it because spire was included somehow (maybe one of the dependent packages use it).
@codedeft Thanks for your nice work. I have added some comments inline. Here are some high level comments:
|
@chouqin Checkpointing is helpful since it is more persistent than persist(). Checkpointing stores data to HDFS (with replication), so that the RDD is stored even if a worker dies. With persist(), part of the RDD will be lost when a worker dies. For my big decision tree tests, I do see EC2 workers die periodically (though not that often), and I am sure it is a bigger issue for corporate (big) clusters. |
@codedeft Thanks for the PR! I'll make a pass now. |
* @param data The RDD of training rows. | ||
* @param cur The initial values in the cache | ||
* (should be an Array of all 1's (meaning the root nodes)). | ||
* @param checkpointDir The checkpoint directory where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this skips checkpointing if checkpointDir == None. However, a user could set the SparkContext checkpointDir before calling DecisionTree. Can the behavior be changed as follows:
- If a checkpointDir is given here, then it should overwrite any preset checkpointDir in SparkContext.
- If no checkpointDir is given, then the code should check the SparkContext (via cur.sparkContext.getCheckpointDir) to see if one has already been set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
I've addressed the comments. Please review at your convenience. I'll publish some big data results once they are actually done. Thanks! |
Your test analysis is pretty convincing! Keeping the PR sounds good. |
* A given TreePoint would belong to a particular node per tree. | ||
* This is used to keep track of which node for a particular tree that a TreePoint belongs to. | ||
* A separate RDD of Array[Int] needs to be maintained and updated at each iteration. | ||
* @param nodeIdsForInstances The initial values in the cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit unclear; perhaps update to: "For each TreePoint, an array over trees of the node index in each tree. (Initially, values should all be 1 for root node.)"
Test build #22639 has started for PR 2868 at commit
|
LGTM Thanks! |
Test build #22639 has finished for PR 2868 at commit
|
Test FAILed. |
Some sort of YARN failure. |
Test build #498 has started for PR 2868 at commit
|
Yea, I'm also getting Yarn compilation failure on my machine after doing the latest pull. Is this happening to everyone? |
Yep, apparently so, but someone's working on fixing it ASAP |
Test build #498 has finished for PR 2868 at commit
|
The conflict is caused by the GBoosting check-in. I'm taking a look. |
Test build #22684 has started for PR 2868 at commit
|
@mengxr @jkbradley Can you merge this? This is the only way you can effectively train 10 large trees with the mnist8m dataset. With node Id cache, it took a very long time, but we were able to finish training 10 trees on mnist8m in 15 hours with 20 executors. SF with local training can finish this in 20 minutes, so local training would be a must in the next release. However, without node Id cache, it looks like it's not even possible. It's currently only 60% of the way there and it's already taken 13 hours and dozens of fetch failures. I feel that it might eventually just fail because the models are just too big to pass around. |
@CodeLeft I agree that local training should be a high priority. Just curious -- what's the depth of the tree in the failing case? I vote for merging this PR since there is no loss in performance for shallow trees and gain in performance for deep trees. |
Test build #22684 has finished for PR 2868 at commit
|
Test PASSed. |
@manishamde the person you want to respond to is @codedeft. I'm not involved with this project. Our names are close, but off by one letter. Sorry for the intrusion, I'll see myself out. |
@CodeLeft I am so sorry. |
It finally finished. 10 Trees, 30 depth limit. mnist8m, 20 executors: 15 hours with node Id cache. |
I've merged this into master. Thanks @codedeft adding node id caching, and @chouqin @manishamde @jkbradley for reviewing the code! |
…eci... ...sion trees. jkbradley mengxr chouqin Please review this. Author: Sung Chung <schung@alpinenow.com> Closes #2868 from codedeft/SPARK-3161 and squashes the following commits: 5f5a156 [Sung Chung] [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training decision trees.
@codedeft The merge script didn't close this PR automatically. Could you help close it? Thanks! |
...sion trees. @jkbradley @mengxr @chouqin Please review this.