Skip to content

[SPARK-18088][ML] Various ChiSqSelector cleanups #15647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -1333,14 +1333,14 @@ for more details on the API.
`ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with
categorical features. ChiSqSelector uses the
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`:
features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`:

* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
* `FPR` chooses all features whose false positive rate meets some threshold.
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.

By default, the selection method is `KBest`, the default number of top features is 50. User can use
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
The user can choose a selection method using `setSelectorType`.

**Examples**

Expand Down
15 changes: 6 additions & 9 deletions docs/mllib-feature-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,22 +227,19 @@ both speed and statistical learning behavior.
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements
Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`:
features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`:

* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
* `FPR` chooses all features whose false positive rate meets some threshold.
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.

By default, the selection method is `KBest`, the default number of top features is 50. User can use
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
The user can choose a selection method using `setSelectorType`.

The number of features to select can be tuned using a held-out validation set.

### Model Fitting

`ChiSqSelector` takes a `numTopFeatures` parameter specifying the number of top features that
the selector will select.

The [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method takes
an input of `RDD[LabeledPoint]` with categorical features, learns the summary statistics, and then
returns a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,69 +42,80 @@ private[feature] trait ChiSqSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {

/**
* Number of features that selector will select (ordered by statistic value descending). If the
* Number of features that selector will select, ordered by ascending p-value. If the
* number of features is less than numTopFeatures, then this will select all features.
* Only applicable when selectorType = "kbest".
* Only applicable when selectorType = "numTopFeatures".
* The default value of numTopFeatures is 50.
*
* @group param
*/
@Since("1.6.0")
final val numTopFeatures = new IntParam(this, "numTopFeatures",
"Number of features that selector will select, ordered by statistics value descending. If the" +
"Number of features that selector will select, ordered by ascending p-value. If the" +
" number of features is < numTopFeatures, then this will select all features.",
ParamValidators.gtEq(1))
setDefault(numTopFeatures -> 50)

/** @group getParam */
@Since("1.6.0")
def getNumTopFeatures: Int = $(numTopFeatures)

/**
* Percentile of features that selector will select, ordered by statistics value descending.
* Only applicable when selectorType = "percentile".
* Default value is 0.1.
* @group param
*/
@Since("2.1.0")
final val percentile = new DoubleParam(this, "percentile",
"Percentile of features that selector will select, ordered by statistics value descending.",
"Percentile of features that selector will select, ordered by ascending p-value.",
ParamValidators.inRange(0, 1))
setDefault(percentile -> 0.1)

/** @group getParam */
@Since("2.1.0")
def getPercentile: Double = $(percentile)

/**
* The highest p-value for features to be kept.
* Only applicable when selectorType = "fpr".
* Default value is 0.05.
* @group param
*/
final val alpha = new DoubleParam(this, "alpha", "The highest p-value for features to be kept.",
final val fpr = new DoubleParam(this, "fpr", "The highest p-value for features to be kept.",
ParamValidators.inRange(0, 1))
setDefault(alpha -> 0.05)
setDefault(fpr -> 0.05)

/** @group getParam */
def getAlpha: Double = $(alpha)
def getFpr: Double = $(fpr)

/**
* The selector type of the ChisqSelector.
* Supported options: "kbest" (default), "percentile" and "fpr".
* Supported options: "numTopFeatures" (default), "percentile", "fpr".
* @group param
*/
@Since("2.1.0")
final val selectorType = new Param[String](this, "selectorType",
"The selector type of the ChisqSelector. " +
"Supported options: kbest (default), percentile and fpr.",
ParamValidators.inArray[String](OldChiSqSelector.supportedSelectorTypes.toArray))
setDefault(selectorType -> OldChiSqSelector.KBest)
"Supported options: " + OldChiSqSelector.supportedSelectorTypes.mkString(", "),
ParamValidators.inArray[String](OldChiSqSelector.supportedSelectorTypes))
setDefault(selectorType -> OldChiSqSelector.NumTopFeatures)

/** @group getParam */
@Since("2.1.0")
def getSelectorType: String = $(selectorType)
}

/**
* Chi-Squared feature selection, which selects categorical features to use for predicting a
* categorical label.
* The selector supports three selection methods: `kbest`, `percentile` and `fpr`.
* `kbest` chooses the `k` top features according to a chi-squared test.
* `percentile` is similar but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose false positive rate meets some threshold.
* By default, the selection method is `kbest`, the default number of top features is 50.
* The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
* - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
* - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
* - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
* positive rate of selection.
* By default, the selection method is `numTopFeatures`, with the default number of top features
* set to 50.
*/
@Since("1.6.0")
final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: String)
Expand All @@ -113,10 +124,6 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str
@Since("1.6.0")
def this() = this(Identifiable.randomUID("chiSqSelector"))

/** @group setParam */
@Since("2.1.0")
def setSelectorType(value: String): this.type = set(selectorType, value)

/** @group setParam */
@Since("1.6.0")
def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
Expand All @@ -127,7 +134,11 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str

/** @group setParam */
@Since("2.1.0")
def setAlpha(value: Double): this.type = set(alpha, value)
def setFpr(value: Double): this.type = set(fpr, value)

/** @group setParam */
@Since("2.1.0")
def setSelectorType(value: String): this.type = set(selectorType, value)

/** @group setParam */
@Since("1.6.0")
Expand All @@ -153,15 +164,15 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str
.setSelectorType($(selectorType))
.setNumTopFeatures($(numTopFeatures))
.setPercentile($(percentile))
.setAlpha($(alpha))
.setFpr($(fpr))
val model = selector.fit(input)
copyValues(new ChiSqSelectorModel(uid, model).setParent(this))
}

@Since("1.6.0")
override def transformSchema(schema: StructType): StructType = {
val otherPairs = OldChiSqSelector.supportedTypeAndParamPairs.filter(_._1 != $(selectorType))
otherPairs.foreach { case (_, paramName: String) =>
val otherPairs = OldChiSqSelector.supportedSelectorTypes.filter(_ != $(selectorType))
otherPairs.foreach { paramName: String =>
if (isSet(getParam(paramName))) {
logWarning(s"Param $paramName will take no effect when selector type = ${$(selectorType)}.")
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -638,13 +638,13 @@ private[python] class PythonMLLibAPI extends Serializable {
selectorType: String,
numTopFeatures: Int,
percentile: Double,
alpha: Double,
fpr: Double,
data: JavaRDD[LabeledPoint]): ChiSqSelectorModel = {
new ChiSqSelector()
.setSelectorType(selectorType)
.setNumTopFeatures(numTopFeatures)
.setPercentile(percentile)
.setAlpha(alpha)
.setFpr(fpr)
.fit(data.rdd)
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] {
Loader.checkSchema[Data](dataFrame.schema)

val features = dataArray.rdd.map {
case Row(feature: Int) => (feature)
case Row(feature: Int) => feature
}.collect()

new ChiSqSelectorModel(features)
Expand All @@ -171,18 +171,20 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] {

/**
* Creates a ChiSquared feature selector.
* The selector supports three selection methods: `kbest`, `percentile` and `fpr`.
* `kbest` chooses the `k` top features according to a chi-squared test.
* `percentile` is similar but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose false positive rate meets some threshold.
* By default, the selection method is `kbest`, the default number of top features is 50.
* The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
* - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
* - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
* - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
* positive rate of selection.
* By default, the selection method is `numTopFeatures`, with the default number of top features
* set to 50.
*/
@Since("1.3.0")
class ChiSqSelector @Since("2.1.0") () extends Serializable {
var numTopFeatures: Int = 50
var percentile: Double = 0.1
var alpha: Double = 0.05
var selectorType = ChiSqSelector.KBest
var fpr: Double = 0.05
var selectorType = ChiSqSelector.NumTopFeatures

/**
* The is the same to call this() and setNumTopFeatures(numTopFeatures)
Expand All @@ -207,15 +209,15 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
}

@Since("2.1.0")
def setAlpha(value: Double): this.type = {
require(0.0 <= value && value <= 1.0, "Alpha must be in [0,1]")
alpha = value
def setFpr(value: Double): this.type = {
require(0.0 <= value && value <= 1.0, "FPR must be in [0,1]")
fpr = value
this
}

@Since("2.1.0")
def setSelectorType(value: String): this.type = {
require(ChiSqSelector.supportedSelectorTypes.toSeq.contains(value),
require(ChiSqSelector.supportedSelectorTypes.contains(value),
s"ChiSqSelector Type: $value was not supported.")
selectorType = value
this
Expand All @@ -232,7 +234,7 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
val chiSqTestResult = Statistics.chiSqTest(data).zipWithIndex
val features = selectorType match {
case ChiSqSelector.KBest =>
case ChiSqSelector.NumTopFeatures =>
chiSqTestResult
.sortBy { case (res, _) => res.pValue }
.take(numTopFeatures)
Expand All @@ -242,7 +244,7 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
.take((chiSqTestResult.length * percentile).toInt)
case ChiSqSelector.FPR =>
chiSqTestResult
.filter { case (res, _) => res.pValue < alpha }
.filter { case (res, _) => res.pValue < fpr }
case errorType =>
throw new IllegalStateException(s"Unknown ChiSqSelector Type: $errorType")
}
Expand All @@ -251,22 +253,17 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
}
}

@Since("2.1.0")
object ChiSqSelector {
private[spark] object ChiSqSelector {

/** String name for `kbest` selector type. */
private[spark] val KBest: String = "kbest"
/** String name for `numTopFeatures` selector type. */
val NumTopFeatures: String = "numTopFeatures"

/** String name for `percentile` selector type. */
private[spark] val Percentile: String = "percentile"
val Percentile: String = "percentile"

/** String name for `fpr` selector type. */
private[spark] val FPR: String = "fpr"

/** Set of selector type and param pairs that ChiSqSelector supports. */
private[spark] val supportedTypeAndParamPairs = Set(KBest -> "numTopFeatures",
Percentile -> "percentile", FPR -> "alpha")

/** Set of selector types that ChiSqSelector supports. */
private[spark] val supportedSelectorTypes = supportedTypeAndParamPairs.map(_._1)
val supportedSelectorTypes: Array[String] = Array(NumTopFeatures, Percentile, FPR)
}
Loading