-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-18088][ML] Various ChiSqSelector cleanups #15647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Renamed kbest to numTopFeatures * Renamed alpha to fpr * Added missing Since annotations * Doc cleanups * Added missing standard unit tests
CC: @mpjlu @srowen @yanboliang |
Test build #67590 has finished for PR 15647 at commit
|
Test build #67593 has finished for PR 15647 at commit
|
Test build #67596 has finished for PR 15647 at commit
|
test this please |
Test build #67599 has finished for PR 15647 at commit
|
Test build #67597 has finished for PR 15647 at commit
|
I'll fix these later tonight |
Test build #67626 has finished for PR 15647 at commit
|
* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. | ||
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number. | ||
* `FPR` chooses all features whose false positive rate meets some threshold. | ||
* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use k here since KBest is changed to numTopFeatures?
By default, the selection method is `KBest`, the default number of top features is 50. User can use | ||
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods. | ||
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use | ||
`setNumTopFeatures`, `setPercentile` and `setFpr` to set different selection methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setNumTopFeatures, setPercentile, setFpr just set parameter, setSelectorType is used to set selection methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we use setSelectorType
to switch between different method. Actually the document is incorrect before this PR.
* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. | ||
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number. | ||
* `FPR` chooses all features whose false positive rate meets some threshold. | ||
* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
By default, the selection method is `KBest`, the default number of top features is 50. User can use | ||
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods. | ||
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use | ||
`setNumTopFeatures`, `setPercentile`, `fpr` to set different selection methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
* The default value of numTopFeatures is 50. | ||
* | ||
* @group param | ||
*/ | ||
@Since("1.6.0") | ||
final val numTopFeatures = new IntParam(this, "numTopFeatures", | ||
"Number of features that selector will select, ordered by statistics value descending. If the" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordered by pValue ascending. I missed this change in my PR.
*/ | ||
@Since("2.1.0") | ||
final val percentile = new DoubleParam(this, "percentile", | ||
"Percentile of features that selector will select, ordered by statistics value descending.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordered by pValue ascending.
* `fpr` chooses all features whose false positive rate meets some threshold. | ||
* By default, the selection method is `kbest`, the default number of top features is 50. | ||
* The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`. | ||
* - `numTopFeatures` chooses the `k` top features according to a chi-squared test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
* `fpr` chooses all features whose false positive rate meets some threshold. | ||
* By default, the selection method is `kbest`, the default number of top features is 50. | ||
* The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`. | ||
* - `numTopFeatures` chooses the `k` top features according to a chi-squared test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
} | ||
|
||
test("ChiSqSelector by FPR transform test (sparse & dense vector)") { | ||
test("ChiSqSelector by percentile transform test (sparse & dense vector)") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be FPR
@@ -2624,29 +2624,30 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja | |||
"will select, ordered by statistics value descending.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordered by pValue ascending
By default, the selection method is `KBest`, the default number of top features is 50. User can use | ||
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods. | ||
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use | ||
`setNumTopFeatures`, `setPercentile` and `setFpr` to set different selection methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we use setSelectorType
to switch between different method. Actually the document is incorrect before this PR.
@@ -44,67 +44,78 @@ private[feature] trait ChiSqSelectorParams extends Params | |||
/** | |||
* Number of features that selector will select (ordered by statistic value descending). If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordered by pValue ascending
Updated---thanks! |
Test build #67672 has finished for PR 15647 at commit
|
Any more comments, or shall I merge this? |
test this please |
Test build #67923 has finished for PR 15647 at commit
|
I'll go ahead and merge this, but please comment if it needs any follow-ups. Merging with master Thanks for the review! |
## What changes were proposed in this pull request? - Renamed kbest to numTopFeatures - Renamed alpha to fpr - Added missing Since annotations - Doc cleanups ## How was this patch tested? Added new standardized unit tests for spark.ml. Improved existing unit test coverage a bit. Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#15647 from jkbradley/chisqselector-follow-ups.
What changes were proposed in this pull request?
How was this patch tested?
Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.