[SPARK-18088][ML] Various ChiSqSelector cleanups #15647

jkbradley · 2016-10-26T17:45:45Z

What changes were proposed in this pull request?

Renamed kbest to numTopFeatures
Renamed alpha to fpr
Added missing Since annotations
Doc cleanups

How was this patch tested?

Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.

* Renamed kbest to numTopFeatures * Renamed alpha to fpr * Added missing Since annotations * Doc cleanups * Added missing standard unit tests

jkbradley · 2016-10-26T17:46:24Z

CC: @mpjlu @srowen @yanboliang

SparkQA · 2016-10-26T17:51:08Z

Test build #67590 has finished for PR 15647 at commit 10f8b26.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-26T19:26:48Z

Test build #67593 has finished for PR 15647 at commit 1ead234.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-26T20:28:28Z

Test build #67596 has finished for PR 15647 at commit 77f05ef.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-10-26T20:32:13Z

test this please

SparkQA · 2016-10-26T21:23:44Z

Test build #67599 has finished for PR 15647 at commit 77f05ef.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-26T21:25:29Z

Test build #67597 has finished for PR 15647 at commit 77f05ef.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-10-26T23:59:36Z

I'll fix these later tonight

SparkQA · 2016-10-27T06:42:24Z

Test build #67626 has finished for PR 15647 at commit 8b3ed65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2016-10-27T10:38:10Z

docs/ml-features.md

-* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
-* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
-* `FPR` chooses all features whose false positive rate meets some threshold.
+* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.


Should use k here since KBest is changed to numTopFeatures?

mpjlu · 2016-10-27T10:39:38Z

docs/ml-features.md

-By default, the selection method is `KBest`, the default number of top features is 50. User can use
-`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
+By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use
+`setNumTopFeatures`, `setPercentile` and `setFpr` to set different selection methods.


setNumTopFeatures, setPercentile, setFpr just set parameter, setSelectorType is used to set selection methods.

Yeah, we use setSelectorType to switch between different method. Actually the document is incorrect before this PR.

mpjlu · 2016-10-27T10:40:06Z

docs/mllib-feature-extraction.md

-* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
-* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
-* `FPR` chooses all features whose false positive rate meets some threshold.
+* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.


mpjlu · 2016-10-27T10:40:19Z

docs/mllib-feature-extraction.md

-By default, the selection method is `KBest`, the default number of top features is 50. User can use
-`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
+By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use
+`setNumTopFeatures`, `setPercentile`, `fpr` to set different selection methods.


mpjlu · 2016-10-27T10:45:46Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

   * The default value of numTopFeatures is 50.
   *
   * @group param
   */
+  @Since("1.6.0")
  final val numTopFeatures = new IntParam(this, "numTopFeatures",
    "Number of features that selector will select, ordered by statistics value descending. If the" +


ordered by pValue ascending. I missed this change in my PR.

mpjlu · 2016-10-27T10:46:12Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

   */
+  @Since("2.1.0")
  final val percentile = new DoubleParam(this, "percentile",
    "Percentile of features that selector will select, ordered by statistics value descending.",


ordered by pValue ascending.

mpjlu · 2016-10-27T10:47:22Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

- * `fpr` chooses all features whose false positive rate meets some threshold.
- * By default, the selection method is `kbest`, the default number of top features is 50.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
+ *  - `numTopFeatures` chooses the `k` top features according to a chi-squared test.


mpjlu · 2016-10-27T10:48:24Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

- * `fpr` chooses all features whose false positive rate meets some threshold.
- * By default, the selection method is `kbest`, the default number of top features is 50.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
+ *  - `numTopFeatures` chooses the `k` top features according to a chi-squared test.


mpjlu · 2016-10-27T10:53:43Z

mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala

  }

-  test("ChiSqSelector by FPR transform test (sparse & dense vector)") {
+  test("ChiSqSelector by percentile transform test (sparse & dense vector)") {


should be FPR

mpjlu · 2016-10-27T10:54:53Z

python/pyspark/ml/feature.py

@@ -2624,29 +2624,30 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
                       "will select, ordered by statistics value descending.",


ordered by pValue ascending

yanboliang · 2016-10-27T14:52:26Z

docs/ml-features.md

-By default, the selection method is `KBest`, the default number of top features is 50. User can use
-`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
+By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use
+`setNumTopFeatures`, `setPercentile` and `setFpr` to set different selection methods.


Yeah, we use setSelectorType to switch between different method. Actually the document is incorrect before this PR.

yanboliang · 2016-10-27T14:54:32Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

@@ -44,67 +44,78 @@ private[feature] trait ChiSqSelectorParams extends Params
  /**
   * Number of features that selector will select (ordered by statistic value descending). If the


ordered by pValue ascending

jkbradley · 2016-10-27T22:11:20Z

Updated---thanks!

SparkQA · 2016-10-27T23:14:22Z

Test build #67672 has finished for PR 15647 at commit 7d3c74c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-11-01T21:08:18Z

Any more comments, or shall I merge this?

jkbradley · 2016-11-01T21:08:26Z

test this please

SparkQA · 2016-11-01T22:15:46Z

Test build #67923 has finished for PR 15647 at commit 7d3c74c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-11-01T23:59:02Z

I'll go ahead and merge this, but please comment if it needs any follow-ups.

Merging with master

Thanks for the review!

## What changes were proposed in this pull request? - Renamed kbest to numTopFeatures - Renamed alpha to fpr - Added missing Since annotations - Doc cleanups ## How was this patch tested? Added new standardized unit tests for spark.ml. Improved existing unit test coverage a bit. Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#15647 from jkbradley/chisqselector-follow-ups.

Various ChiSqSelector cleanups:

10f8b26

* Renamed kbest to numTopFeatures * Renamed alpha to fpr * Added missing Since annotations * Doc cleanups * Added missing standard unit tests

python lint fix

1ead234

fix python tests

77f05ef

actually fixed python tests

8b3ed65

mpjlu reviewed Oct 27, 2016

View reviewed changes

yanboliang reviewed Oct 27, 2016

View reviewed changes

code review fixes

7d3c74c

asfgit closed this in 91c33a0 Nov 2, 2016

jkbradley deleted the chisqselector-follow-ups branch November 2, 2016 02:18

		@@ -2624,29 +2624,30 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
		"will select, ordered by statistics value descending.",

		@@ -44,67 +44,78 @@ private[feature] trait ChiSqSelectorParams extends Params
		/**
		* Number of features that selector will select (ordered by statistic value descending). If the

[SPARK-18088][ML] Various ChiSqSelector cleanups #15647

[SPARK-18088][ML] Various ChiSqSelector cleanups #15647

Uh oh!

Conversation

jkbradley commented Oct 26, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jkbradley commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 26, 2016

Uh oh!

jkbradley commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 26, 2016

Uh oh!

jkbradley commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Oct 27, 2016

Uh oh!

SparkQA commented Oct 27, 2016

Uh oh!

jkbradley commented Nov 1, 2016

Uh oh!

jkbradley commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

jkbradley commented Nov 1, 2016

Uh oh!

Uh oh!