-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[MLLIB][tree] Verify size of input rdd > 0 when building meta data #5810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Require non empty input rdd such that we can take the first labeledpoint and get the feature size
val numExamples = input.count() | ||
require(numExamples > 0, s"DecisionTree requires size of input RDD > 0, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use isEmpty
rather than count the whole data set. Does this help much? You get an exception either way. Although this makes the message nicer. At the cost of non-trivial extra work.
At this stage wouldn't the size have already had to be positive? have you encountered this in real life?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. After some filter work on dataset, the result rdd turns out to be empty which is unexpected but happens. This exception actually comes from the nature of the dataset but has not been captured by the Decision Tree algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'm fine with this as long as it only uses isEmpty
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it.
…ding meta data Use rdd api isEmpty
@@ -107,8 +107,11 @@ private[tree] object DecisionTreeMetadata extends Logging { | |||
numTrees: Int, | |||
featureSubsetStrategy: String): DecisionTreeMetadata = { | |||
|
|||
val numFeatures = input.take(1)(0).features.size | |||
require(!input.isEmpty, s"DecisionTree requires size of input RDD > 0, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will trigger take(1)
twice. Should be
val numFeatures = input.map(_.features.size).take(1).headOption.getOrElse {
throw new IllegalArgumentException("...")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I also think this line should be placed before "val numExamples = input.count()" thus there will be a fast fail.
@AIHE Can you incorporate Xiangrui's last suggestion? let's get this in then. |
test this please |
Test build #31797 has finished for PR 5810 at commit
|
Jenkins, retest this please. |
Test build #31872 has finished for PR 5810 at commit
|
Jenkins, retest this please. |
Test build #31879 has finished for PR 5810 at commit
|
Going to merge with a small message change, "by empty" -> "an empty" |
Require non empty input rdd such that we can take the first labeledpoint and get the feature size Author: Alain <aihe@usc.edu> Author: aihe@usc.edu <aihe@usc.edu> Closes apache#5810 from AiHe/decisiontree-issue and squashes the following commits: 3b1d08a [aihe@usc.edu] [MLLIB][tree] merge the assertion into the evaluation of numFeatures cf2e567 [Alain] [MLLIB][tree] Use a rdd api to verify size of input rdd > 0 when building meta data b448f47 [Alain] [MLLIB][tree] Verify size of input rdd > 0 when building meta data
Require non empty input rdd such that we can take the first labeledpoint and get the feature size Author: Alain <aihe@usc.edu> Author: aihe@usc.edu <aihe@usc.edu> Closes apache#5810 from AiHe/decisiontree-issue and squashes the following commits: 3b1d08a [aihe@usc.edu] [MLLIB][tree] merge the assertion into the evaluation of numFeatures cf2e567 [Alain] [MLLIB][tree] Use a rdd api to verify size of input rdd > 0 when building meta data b448f47 [Alain] [MLLIB][tree] Verify size of input rdd > 0 when building meta data
Require non empty input rdd such that we can take the first labeledpoint and get the feature size Author: Alain <aihe@usc.edu> Author: aihe@usc.edu <aihe@usc.edu> Closes apache#5810 from AiHe/decisiontree-issue and squashes the following commits: 3b1d08a [aihe@usc.edu] [MLLIB][tree] merge the assertion into the evaluation of numFeatures cf2e567 [Alain] [MLLIB][tree] Use a rdd api to verify size of input rdd > 0 when building meta data b448f47 [Alain] [MLLIB][tree] Verify size of input rdd > 0 when building meta data
Require non empty input rdd such that we can take the first labeledpoint and get the feature size