Skip to content

[MLLIB][tree] Verify size of input rdd > 0 when building meta data #5810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

aihex
Copy link

@aihex aihex commented Apr 30, 2015

Require non empty input rdd such that we can take the first labeledpoint and get the feature size

Require non empty input rdd such that we can take the first
labeledpoint and get the feature size
val numExamples = input.count()
require(numExamples > 0, s"DecisionTree requires size of input RDD > 0, " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use isEmpty rather than count the whole data set. Does this help much? You get an exception either way. Although this makes the message nicer. At the cost of non-trivial extra work.

At this stage wouldn't the size have already had to be positive? have you encountered this in real life?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. After some filter work on dataset, the result rdd turns out to be empty which is unexpected but happens. This exception actually comes from the nature of the dataset but has not been captured by the Decision Tree algorithm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'm fine with this as long as it only uses isEmpty.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it.

@@ -107,8 +107,11 @@ private[tree] object DecisionTreeMetadata extends Logging {
numTrees: Int,
featureSubsetStrategy: String): DecisionTreeMetadata = {

val numFeatures = input.take(1)(0).features.size
require(!input.isEmpty, s"DecisionTree requires size of input RDD > 0, " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will trigger take(1) twice. Should be

val numFeatures = input.map(_.features.size).take(1).headOption.getOrElse {
  throw new IllegalArgumentException("...")
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I also think this line should be placed before "val numExamples = input.count()" thus there will be a fast fail.

@srowen
Copy link
Member

srowen commented May 3, 2015

@AIHE Can you incorporate Xiangrui's last suggestion? let's get this in then.

@aihex aihex closed this May 4, 2015
@aihex aihex force-pushed the decisiontree-issue branch from cf2e567 to d188b8b Compare May 4, 2015 23:16
@aihex aihex reopened this May 4, 2015
@mengxr
Copy link
Contributor

mengxr commented May 4, 2015

test this please

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31797 has finished for PR 5810 at commit 3b1d08a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 5, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31872 has finished for PR 5810 at commit 3b1d08a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 5, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31879 has finished for PR 5810 at commit 3b1d08a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 5, 2015

Going to merge with a small message change, "by empty" -> "an empty"

@asfgit asfgit closed this in d4cb38a May 5, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
Require non empty input rdd such that we can take the first labeledpoint and get the feature size

Author: Alain <aihe@usc.edu>
Author: aihe@usc.edu <aihe@usc.edu>

Closes apache#5810 from AiHe/decisiontree-issue and squashes the following commits:

3b1d08a [aihe@usc.edu] [MLLIB][tree] merge the assertion into the evaluation of numFeatures
cf2e567 [Alain] [MLLIB][tree] Use a rdd api to verify size of input rdd > 0 when building meta data
b448f47 [Alain] [MLLIB][tree] Verify size of input rdd > 0 when building meta data
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Require non empty input rdd such that we can take the first labeledpoint and get the feature size

Author: Alain <aihe@usc.edu>
Author: aihe@usc.edu <aihe@usc.edu>

Closes apache#5810 from AiHe/decisiontree-issue and squashes the following commits:

3b1d08a [aihe@usc.edu] [MLLIB][tree] merge the assertion into the evaluation of numFeatures
cf2e567 [Alain] [MLLIB][tree] Use a rdd api to verify size of input rdd > 0 when building meta data
b448f47 [Alain] [MLLIB][tree] Verify size of input rdd > 0 when building meta data
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Require non empty input rdd such that we can take the first labeledpoint and get the feature size

Author: Alain <aihe@usc.edu>
Author: aihe@usc.edu <aihe@usc.edu>

Closes apache#5810 from AiHe/decisiontree-issue and squashes the following commits:

3b1d08a [aihe@usc.edu] [MLLIB][tree] merge the assertion into the evaluation of numFeatures
cf2e567 [Alain] [MLLIB][tree] Use a rdd api to verify size of input rdd > 0 when building meta data
b448f47 [Alain] [MLLIB][tree] Verify size of input rdd > 0 when building meta data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants