[SPARK-20619][ML] StringIndexer supports multiple ways to order label #17879

actuaryzhang · 2017-05-06T06:59:36Z

What changes were proposed in this pull request?

StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula.

This PR proposes to support other ordering methods and we add a parameter stringOrderType that supports the following four options:

'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
'alphabetDesc': descending alphabetical order
'alphabetAsc': ascending alphabetical order

The default is still descending order of label frequency, so there should be no impact to existing programs.

How was this patch tested?

new test

actuaryzhang · 2017-05-06T07:10:47Z

@jkbradley @MLnick @holdenk @pnpritchard @yanboliang @sethah @imatiach-msft @srowen

An example to illustrate the idea:

val data = Seq((0, "b"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "b"))
val df = data.toDF("id", "label")
val indexer = new StringIndexer()
      .setInputCol("label")
      .setOutputCol("labelIndex")
df.show
+---+-----+
| id|label|
+---+-----+
|  0|    b|
|  1|    b|
|  2|    c|
|  3|    a|
|  4|    a|
|  5|    b|
+---+-----+

Below is the result corresponding to the different types of label ordering.

indexer.setStringOrderType("freq_desc").fit(df).transform(df)
+---+-----+----------+
| id|label|labelIndex|
+---+-----+----------+
|  0|    b|       0.0|
|  1|    b|       0.0|
|  2|    c|       2.0|
|  3|    a|       1.0|
|  4|    a|       1.0|
|  5|    b|       0.0|
+---+-----+----------+

indexer.setStringOrderType("freq_asc").fit(df).transform(df)
+---+-----+----------+
| id|label|labelIndex|
+---+-----+----------+
|  0|    b|       2.0|
|  1|    b|       2.0|
|  2|    c|       0.0|
|  3|    a|       1.0|
|  4|    a|       1.0|
|  5|    b|       2.0|
+---+-----+----------+

indexer.setStringOrderType("alphabet_desc").fit(df).transform(df)
+---+-----+----------+
| id|label|labelIndex|
+---+-----+----------+
|  0|    b|       1.0|
|  1|    b|       1.0|
|  2|    c|       0.0|
|  3|    a|       2.0|
|  4|    a|       2.0|
|  5|    b|       1.0|
+---+-----+----------+

indexer.setStringOrderType("alphabet_asc").fit(df).transform(df)
+---+-----+----------+
| id|label|labelIndex|
+---+-----+----------+
|  0|    b|       1.0|
|  1|    b|       1.0|
|  2|    c|       2.0|
|  3|    a|       0.0|
|  4|    a|       0.0|
|  5|    b|       1.0|
+---+-----+----------+

holdenk · 2017-05-06T07:38:15Z

What would be some common cases where alphabet ordering would be needed?

actuaryzhang · 2017-05-06T07:57:26Z

@holdenk The main motivation for this PR is that the behavior of StringIndexer will affect OneHotEncoder, RFormula and models estimated based on these transformers. There have been a few desired improvement in RFormula that could not be done without the change in StringIndexer.

One use case for alphabetical ordering is to make comparison of Spark model results to that in R, which drops the first alphabetical value in one-hot encoding. Right now, even though we do lots of comparisons between Spark and R, we lack comparisons involving String features because the encoding is different. There is already a JIRA.

Another motivation for this PR is to support ascending order by label frequency. This is also related to one-hot encoding. In practical applications of regression type models, it is almost always better to set the most frequent label as the reference level (i.e., drop the most frequent label in OneHotEncoding) for better interpretability. Right now, the behavior is the opposite and has made it very difficult to interpret results.

I think the flexibility of different ordering will benefit a lot the downstream feature transformers and model estimators. Does this make sense?

SparkQA · 2017-05-06T08:00:15Z

Test build #76517 has finished for PR 17879 at commit ffd0cfc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-05-06T08:03:47Z

@yanboliang Since you have reported a few issues due to different encoding between Spark and R (e.g., #SPARK-14659 and #SPARK-14657), probably you could add some comments?

viirya · 2017-05-06T08:17:40Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+  final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
+    "The method used to order values of input column. " +
+      s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
+    (value: String) => StringIndexer.supportedStringOrderType.contains(value.toLowerCase))


Use ParamValidators.inArray?

@viirya ParamValidators.inArray does not allow case-insensitive validation, does it?

Yeah, I originally thought you'd change to case-sensitive. It looks good to me.

viirya · 2017-05-06T08:23:00Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+      .select(col($(inputCol)).cast(StringType))
+      .rdd.map(_.getString(0))
+    val labels = $(stringOrderType) match {
+      case "freq_desc" => values.countByValue().toSeq.sortBy(-_._2).map(_._1).toArray


Seems this setting is case-insensitive because your param validator doesn't care it, $(stringOrderType) might be upper-case here. We may make it case-sensitive or do toLowerCase here?

@viirya Great catch. Thanks!

actuaryzhang · 2017-05-06T08:50:43Z

@viirya Thanks much for your comments. Made a new commit to address them.

viirya · 2017-05-06T09:28:08Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getStringOrderType: String = $(stringOrderType)


I checked other ML classes. Looks like for a case-insensitive setting, we may do toLowerCase in its public API:

def getStringOrderType: String = $(stringOrderType).toLowerCase

And you can use getStringOrderType below instead of $(stringOrderType).toLowerCase in fit.

@viirya Which ML classes were you referring to? I was told not to change the raw values in the getters in other PRs #16675.

SparkQA · 2017-05-06T09:50:50Z

Test build #76521 has finished for PR 17879 at commit 97e020f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-05-09T03:35:54Z

ping @yanboliang @felixcheung
This is needed for one-hot encoding to be consistent with R, therefore enabling direct comparison of Spark results to R. Could you guys please take a look? Thanks.

felixcheung

LGTM - I want to double check https://github.com/apache/spark/pull/17879/files#r115116845
And given where we are now we should change this from 2.2 to 2.3

felixcheung · 2017-05-09T03:38:39Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

@@ -131,6 +163,12 @@ object StringIndexer extends DefaultParamsReadable[StringIndexer] {
  private[feature] val KEEP_INVALID: String = "keep"
  private[feature] val supportedHandleInvalids: Array[String] =
    Array(SKIP_INVALID, ERROR_INVALID, KEEP_INVALID)
+  private[feature] val FREQ_DESC: String = "freq_desc"


is there any prior standard for these names like freq_desc?

@felixcheung I did not find any prior standard, and am open to suggestion for better names.
Maybe better use frequency_desc or count_desc?

@gatorsmile thought?

actuaryzhang · 2017-05-09T03:52:30Z

@felixcheung Thanks. I will update the annotation.

viirya · 2017-05-09T03:59:30Z

@actuaryzhang There seems something wrong with Github's webpage, so I can't directly reply the above comment. ALSModelParams.getColdStartStrategy is one example.

SparkQA · 2017-05-09T04:49:17Z

Test build #76619 has finished for PR 17879 at commit ba34043.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-05-09T04:55:33Z

Thanks much @felixcheung and @viirya. I have addressed your comments.

update from 2.2 to 2.3
change freq_desc to frequency_desc.
move toLowerCase to the getter method.

Please let me know if there is anything needed. Thanks!

gatorsmile · 2017-05-09T05:21:52Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+  private[feature] val FREQ_DESC: String = "frequency_desc"
+  private[feature] val FREQ_ASC: String = "frequency_asc"
+  private[feature] val ALPHABET_DESC: String = "alphabet_desc"
+  private[feature] val ALPHABET_ASC: String = "alphabet_asc"


Normally, we do not use underscore in the names. lowerCamelCase is our rules for naming.

Thanks for ping me, @felixcheung

@gatorsmile Thanks much for the suggestion. Changed them to lowerCamelCase.
@felixcheung Any additional suggestions?

SparkQA · 2017-05-09T05:56:40Z

Test build #76621 has finished for PR 17879 at commit ff9b1d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-09T06:03:35Z

Test build #76624 has finished for PR 17879 at commit 07198d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-09T06:04:22Z

LGTM

felixcheung · 2017-05-09T06:15:23Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+    "how to order labels of string column. " +
+    "The first label after ordering is assigned an index of 0. " +
+    s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
+    ParamValidators.inArray(StringIndexer.supportedStringOrderType))


so we are going to case sensitive then?

@felixcheung Right. It does not quite make sense to be case insensitive now given that we now use camel case.

SparkQA · 2017-05-09T06:42:36Z

Test build #76628 has finished for PR 17879 at commit 53381ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-05-09T17:52:14Z

@felixcheung @gatorsmile @MLnick @jkbradley @holdenk @yanboliang @srowen @sethah
Would you please take another look and let me know any additional suggestions? Thanks much!

felixcheung

LGTM, did another pass.
@yanboliang @jkbradley do you want to have a look?

actuaryzhang · 2017-05-11T07:32:51Z

@felixcheung Thanks much for your review.
@yanboliang @jkbradley Since there are two approvals, could you guys take a look and merge if it's good? We really need this for a couple of SparkR related issues, e.g., SPARK-14659 and SPARK-14657. Thanks much!

actuaryzhang · 2017-05-12T01:03:59Z

@felixcheung
It would be great if you could help merge this. I could address comments (if any) in a future PR.
This seems a pretty straightforward change that removes a big blocker. I will continue working on the RFormula side once this merges in and fix the SparkR issues related to string ordering.

Thanks much!

felixcheung · 2017-05-12T07:13:05Z

merged to master.

actuaryzhang · 2017-05-12T07:16:42Z

Thanks much!

## What changes were proposed in this pull request? StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula. This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options: - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0) - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0) - 'alphabetDesc': descending alphabetical order - 'alphabetAsc': ascending alphabetical order The default is still descending order of label frequency, so there should be no impact to existing programs. ## How was this patch tested? new test Author: Wayne Zhang <actuaryzhang@uber.com> Closes apache#17879 from actuaryzhang/stringIndexer.

## What changes were proposed in this pull request? PySpark StringIndexer supports StringOrderType added in #17879. Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17978 from actuaryzhang/PythonStringIndexer.

## What changes were proposed in this pull request? StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula. This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options: - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0) - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0) - 'alphabetDesc': descending alphabetical order - 'alphabetAsc': ascending alphabetical order The default is still descending order of label frequency, so there should be no impact to existing programs. ## How was this patch tested? new test Author: Wayne Zhang <actuaryzhang@uber.com> Closes apache#17879 from actuaryzhang/stringIndexer.

## What changes were proposed in this pull request? PySpark StringIndexer supports StringOrderType added in apache#17879. Author: Wayne Zhang <actuaryzhang@uber.com> Closes apache#17978 from actuaryzhang/PythonStringIndexer.

## What changes were proposed in this pull request? When handling strings, the category dropped by RFormula and R are different: - RFormula drops the least frequent level - R drops the first level after ascending alphabetical ordering This PR supports different string ordering types in StringIndexer apache#17879 so that RFormula can drop the same level as R when handling strings using`stringOrderType = "alphabetDesc"`. ## How was this patch tested? new tests Author: Wayne Zhang <actuaryzhang@uber.com> Closes apache#17967 from actuaryzhang/RFormula.

StringIndexer supports multiple ways of label ordering

ffd0cfc

viirya reviewed May 6, 2017

View reviewed changes

address review comments and fix style

97e020f

viirya reviewed May 6, 2017

View reviewed changes

actuaryzhang changed the title ~~[SPARK-20619][ML]StringIndexer supports multiple ways of label ordering~~ [SPARK-20619][ML] StringIndexer supports multiple ways of label ordering May 6, 2017

actuaryzhang changed the title ~~[SPARK-20619][ML] StringIndexer supports multiple ways of label ordering~~ [SPARK-20619][ML] StringIndexer supports multiple ways to order label May 6, 2017

felixcheung reviewed May 9, 2017

View reviewed changes

address comments- spell out freq and update annotation and toLowerCase

ba34043

fix style

ff9b1d6

fix annotation

07198d9

gatorsmile reviewed May 9, 2017

View reviewed changes

Wayne Zhang added 2 commits May 8, 2017 22:39

use camel case

6bbe7df

remove extra import

53381ea

felixcheung reviewed May 9, 2017

View reviewed changes

felixcheung approved these changes May 11, 2017

View reviewed changes

asfgit closed this in af40bb1 May 12, 2017

actuaryzhang deleted the stringIndexer branch May 12, 2017 07:16

This was referenced May 12, 2017

[SPARK-14659][ML] RFormula consistent with R when handling strings #17967

Closed

[SPARK-20736][Python] PySpark StringIndexer supports StringOrderType #17978

Closed

[SPARK-20619][ML] StringIndexer supports multiple ways to order label #17879

[SPARK-20619][ML] StringIndexer supports multiple ways to order label #17879

Uh oh!

Conversation

actuaryzhang commented May 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

actuaryzhang commented May 6, 2017

Uh oh!

holdenk commented May 6, 2017

Uh oh!

actuaryzhang commented May 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 6, 2017

Uh oh!

actuaryzhang commented May 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang commented May 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 6, 2017

Uh oh!

actuaryzhang commented May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang commented May 9, 2017

Uh oh!

viirya commented May 9, 2017

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

actuaryzhang commented May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

viirya commented May 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

actuaryzhang commented May 9, 2017

actuaryzhang commented May 6, 2017 •

edited

Loading

actuaryzhang commented May 6, 2017 •

edited

Loading

actuaryzhang commented May 9, 2017 •

edited

Loading

actuaryzhang commented May 9, 2017 •

edited

Loading

gatorsmile May 9, 2017 •

edited

Loading

actuaryzhang May 9, 2017 •

edited

Loading

felixcheung left a comment •

edited

Loading