Skip to content

[SPARK-18088][ML] Various ChiSqSelector cleanups #15647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

jkbradley
Copy link
Member

What changes were proposed in this pull request?

  • Renamed kbest to numTopFeatures
  • Renamed alpha to fpr
  • Added missing Since annotations
  • Doc cleanups

How was this patch tested?

Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.

* Renamed kbest to numTopFeatures
* Renamed alpha to fpr
* Added missing Since annotations
* Doc cleanups
* Added missing standard unit tests
@jkbradley
Copy link
Member Author

CC: @mpjlu @srowen @yanboliang

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67590 has finished for PR 15647 at commit 10f8b26.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67593 has finished for PR 15647 at commit 1ead234.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67596 has finished for PR 15647 at commit 77f05ef.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

test this please

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67599 has finished for PR 15647 at commit 77f05ef.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67597 has finished for PR 15647 at commit 77f05ef.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

I'll fix these later tonight

@SparkQA
Copy link

SparkQA commented Oct 27, 2016

Test build #67626 has finished for PR 15647 at commit 8b3ed65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
* `FPR` chooses all features whose false positive rate meets some threshold.
* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use k here since KBest is changed to numTopFeatures?

By default, the selection method is `KBest`, the default number of top features is 50. User can use
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use
`setNumTopFeatures`, `setPercentile` and `setFpr` to set different selection methods.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setNumTopFeatures, setPercentile, setFpr just set parameter, setSelectorType is used to set selection methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we use setSelectorType to switch between different method. Actually the document is incorrect before this PR.

* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
* `FPR` chooses all features whose false positive rate meets some threshold.
* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

By default, the selection method is `KBest`, the default number of top features is 50. User can use
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use
`setNumTopFeatures`, `setPercentile`, `fpr` to set different selection methods.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

* The default value of numTopFeatures is 50.
*
* @group param
*/
@Since("1.6.0")
final val numTopFeatures = new IntParam(this, "numTopFeatures",
"Number of features that selector will select, ordered by statistics value descending. If the" +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered by pValue ascending. I missed this change in my PR.

*/
@Since("2.1.0")
final val percentile = new DoubleParam(this, "percentile",
"Percentile of features that selector will select, ordered by statistics value descending.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered by pValue ascending.

* `fpr` chooses all features whose false positive rate meets some threshold.
* By default, the selection method is `kbest`, the default number of top features is 50.
* The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
* - `numTopFeatures` chooses the `k` top features according to a chi-squared test.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

* `fpr` chooses all features whose false positive rate meets some threshold.
* By default, the selection method is `kbest`, the default number of top features is 50.
* The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
* - `numTopFeatures` chooses the `k` top features according to a chi-squared test.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}

test("ChiSqSelector by FPR transform test (sparse & dense vector)") {
test("ChiSqSelector by percentile transform test (sparse & dense vector)") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be FPR

@@ -2624,29 +2624,30 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
"will select, ordered by statistics value descending.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered by pValue ascending

By default, the selection method is `KBest`, the default number of top features is 50. User can use
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use
`setNumTopFeatures`, `setPercentile` and `setFpr` to set different selection methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we use setSelectorType to switch between different method. Actually the document is incorrect before this PR.

@@ -44,67 +44,78 @@ private[feature] trait ChiSqSelectorParams extends Params
/**
* Number of features that selector will select (ordered by statistic value descending). If the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered by pValue ascending

@jkbradley
Copy link
Member Author

Updated---thanks!

@SparkQA
Copy link

SparkQA commented Oct 27, 2016

Test build #67672 has finished for PR 15647 at commit 7d3c74c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Any more comments, or shall I merge this?

@jkbradley
Copy link
Member Author

test this please

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67923 has finished for PR 15647 at commit 7d3c74c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

I'll go ahead and merge this, but please comment if it needs any follow-ups.

Merging with master

Thanks for the review!

@asfgit asfgit closed this in 91c33a0 Nov 2, 2016
@jkbradley jkbradley deleted the chisqselector-follow-ups branch November 2, 2016 02:18
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
- Renamed kbest to numTopFeatures
- Renamed alpha to fpr
- Added missing Since annotations
- Doc cleanups
## How was this patch tested?

Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#15647 from jkbradley/chisqselector-follow-ups.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants