Skip to content

[SPARK-20619][ML] StringIndexer supports multiple ways to order label #17879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

actuaryzhang
Copy link
Contributor

@actuaryzhang actuaryzhang commented May 6, 2017

What changes were proposed in this pull request?

StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula.

This PR proposes to support other ordering methods and we add a parameter stringOrderType that supports the following four options:

  • 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
  • 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
  • 'alphabetDesc': descending alphabetical order
  • 'alphabetAsc': ascending alphabetical order

The default is still descending order of label frequency, so there should be no impact to existing programs.

How was this patch tested?

new test

@actuaryzhang
Copy link
Contributor Author

@jkbradley @MLnick @holdenk @pnpritchard @yanboliang @sethah @imatiach-msft @srowen

An example to illustrate the idea:

val data = Seq((0, "b"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "b"))
val df = data.toDF("id", "label")
val indexer = new StringIndexer()
      .setInputCol("label")
      .setOutputCol("labelIndex")
df.show
+---+-----+
| id|label|
+---+-----+
|  0|    b|
|  1|    b|
|  2|    c|
|  3|    a|
|  4|    a|
|  5|    b|
+---+-----+

Below is the result corresponding to the different types of label ordering.

indexer.setStringOrderType("freq_desc").fit(df).transform(df)
+---+-----+----------+
| id|label|labelIndex|
+---+-----+----------+
|  0|    b|       0.0|
|  1|    b|       0.0|
|  2|    c|       2.0|
|  3|    a|       1.0|
|  4|    a|       1.0|
|  5|    b|       0.0|
+---+-----+----------+

indexer.setStringOrderType("freq_asc").fit(df).transform(df)
+---+-----+----------+
| id|label|labelIndex|
+---+-----+----------+
|  0|    b|       2.0|
|  1|    b|       2.0|
|  2|    c|       0.0|
|  3|    a|       1.0|
|  4|    a|       1.0|
|  5|    b|       2.0|
+---+-----+----------+

indexer.setStringOrderType("alphabet_desc").fit(df).transform(df)
+---+-----+----------+
| id|label|labelIndex|
+---+-----+----------+
|  0|    b|       1.0|
|  1|    b|       1.0|
|  2|    c|       0.0|
|  3|    a|       2.0|
|  4|    a|       2.0|
|  5|    b|       1.0|
+---+-----+----------+

indexer.setStringOrderType("alphabet_asc").fit(df).transform(df)
+---+-----+----------+
| id|label|labelIndex|
+---+-----+----------+
|  0|    b|       1.0|
|  1|    b|       1.0|
|  2|    c|       2.0|
|  3|    a|       0.0|
|  4|    a|       0.0|
|  5|    b|       1.0|
+---+-----+----------+

@holdenk
Copy link
Contributor

holdenk commented May 6, 2017

What would be some common cases where alphabet ordering would be needed?

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 6, 2017

@holdenk The main motivation for this PR is that the behavior of StringIndexer will affect OneHotEncoder, RFormula and models estimated based on these transformers. There have been a few desired improvement in RFormula that could not be done without the change in StringIndexer.

One use case for alphabetical ordering is to make comparison of Spark model results to that in R, which drops the first alphabetical value in one-hot encoding. Right now, even though we do lots of comparisons between Spark and R, we lack comparisons involving String features because the encoding is different. There is already a JIRA.

Another motivation for this PR is to support ascending order by label frequency. This is also related to one-hot encoding. In practical applications of regression type models, it is almost always better to set the most frequent label as the reference level (i.e., drop the most frequent label in OneHotEncoding) for better interpretability. Right now, the behavior is the opposite and has made it very difficult to interpret results.

I think the flexibility of different ordering will benefit a lot the downstream feature transformers and model estimators. Does this make sense?

@SparkQA
Copy link

SparkQA commented May 6, 2017

Test build #76517 has finished for PR 17879 at commit ffd0cfc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@yanboliang Since you have reported a few issues due to different encoding between Spark and R (e.g., #SPARK-14659 and #SPARK-14657), probably you could add some comments?

final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
"The method used to order values of input column. " +
s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
(value: String) => StringIndexer.supportedStringOrderType.contains(value.toLowerCase))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ParamValidators.inArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya ParamValidators.inArray does not allow case-insensitive validation, does it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I originally thought you'd change to case-sensitive. It looks good to me.

.select(col($(inputCol)).cast(StringType))
.rdd.map(_.getString(0))
val labels = $(stringOrderType) match {
case "freq_desc" => values.countByValue().toSeq.sortBy(-_._2).map(_._1).toArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this setting is case-insensitive because your param validator doesn't care it, $(stringOrderType) might be upper-case here. We may make it case-sensitive or do toLowerCase here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Great catch. Thanks!

@actuaryzhang
Copy link
Contributor Author

@viirya Thanks much for your comments. Made a new commit to address them.


/** @group getParam */
@Since("2.2.0")
def getStringOrderType: String = $(stringOrderType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked other ML classes. Looks like for a case-insensitive setting, we may do toLowerCase in its public API:

def getStringOrderType: String = $(stringOrderType).toLowerCase

And you can use getStringOrderType below instead of $(stringOrderType).toLowerCase in fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Which ML classes were you referring to? I was told not to change the raw values in the getters in other PRs #16675.

@SparkQA
Copy link

SparkQA commented May 6, 2017

Test build #76521 has finished for PR 17879 at commit 97e020f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang actuaryzhang changed the title [SPARK-20619][ML]StringIndexer supports multiple ways of label ordering [SPARK-20619][ML] StringIndexer supports multiple ways of label ordering May 6, 2017
@actuaryzhang actuaryzhang changed the title [SPARK-20619][ML] StringIndexer supports multiple ways of label ordering [SPARK-20619][ML] StringIndexer supports multiple ways to order label May 6, 2017
@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 9, 2017

ping @yanboliang @felixcheung
This is needed for one-hot encoding to be consistent with R, therefore enabling direct comparison of Spark results to R. Could you guys please take a look? Thanks.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I want to double check https://github.com/apache/spark/pull/17879/files#r115116845
And given where we are now we should change this from 2.2 to 2.3

@@ -131,6 +163,12 @@ object StringIndexer extends DefaultParamsReadable[StringIndexer] {
private[feature] val KEEP_INVALID: String = "keep"
private[feature] val supportedHandleInvalids: Array[String] =
Array(SKIP_INVALID, ERROR_INVALID, KEEP_INVALID)
private[feature] val FREQ_DESC: String = "freq_desc"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any prior standard for these names like freq_desc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung I did not find any prior standard, and am open to suggestion for better names.
Maybe better use frequency_desc or count_desc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile thought?

@actuaryzhang
Copy link
Contributor Author

@felixcheung Thanks. I will update the annotation.

@viirya
Copy link
Member

viirya commented May 9, 2017

@actuaryzhang There seems something wrong with Github's webpage, so I can't directly reply the above comment. ALSModelParams.getColdStartStrategy is one example.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76619 has finished for PR 17879 at commit ba34043.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 9, 2017

Thanks much @felixcheung and @viirya. I have addressed your comments.

  • update from 2.2 to 2.3
  • change freq_desc to frequency_desc.
  • move toLowerCase to the getter method.

Please let me know if there is anything needed. Thanks!

private[feature] val FREQ_DESC: String = "frequency_desc"
private[feature] val FREQ_ASC: String = "frequency_asc"
private[feature] val ALPHABET_DESC: String = "alphabet_desc"
private[feature] val ALPHABET_ASC: String = "alphabet_asc"
Copy link
Member

@gatorsmile gatorsmile May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we do not use underscore in the names. lowerCamelCase is our rules for naming.

Thanks for ping me, @felixcheung

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Thanks much for the suggestion. Changed them to lowerCamelCase.
@felixcheung Any additional suggestions?

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76621 has finished for PR 17879 at commit ff9b1d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76624 has finished for PR 17879 at commit 07198d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented May 9, 2017

LGTM

"how to order labels of string column. " +
"The first label after ordering is assigned an index of 0. " +
s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
ParamValidators.inArray(StringIndexer.supportedStringOrderType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are going to case sensitive then?

Copy link
Contributor Author

@actuaryzhang actuaryzhang May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Right. It does not quite make sense to be case insensitive now given that we now use camel case.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76628 has finished for PR 17879 at commit 53381ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@felixcheung @gatorsmile @MLnick @jkbradley @holdenk @yanboliang @srowen @sethah
Would you please take another look and let me know any additional suggestions? Thanks much!

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, did another pass.
@yanboliang @jkbradley do you want to have a look?

@actuaryzhang
Copy link
Contributor Author

@felixcheung Thanks much for your review.
@yanboliang @jkbradley Since there are two approvals, could you guys take a look and merge if it's good? We really need this for a couple of SparkR related issues, e.g., SPARK-14659 and SPARK-14657. Thanks much!

@actuaryzhang
Copy link
Contributor Author

@felixcheung
It would be great if you could help merge this. I could address comments (if any) in a future PR.
This seems a pretty straightforward change that removes a big blocker. I will continue working on the RFormula side once this merges in and fix the SparkR issues related to string ordering.

Thanks much!

@felixcheung
Copy link
Member

merged to master.

@asfgit asfgit closed this in af40bb1 May 12, 2017
@actuaryzhang
Copy link
Contributor Author

Thanks much!

@actuaryzhang actuaryzhang deleted the stringIndexer branch May 12, 2017 07:16
robert3005 pushed a commit to palantir/spark that referenced this pull request May 19, 2017
## What changes were proposed in this pull request?

StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL.  For example, the ordering will affect the result in one-hot encoding and RFormula.

This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options:
- 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
- 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
- 'alphabetDesc': descending alphabetical order
- 'alphabetAsc': ascending alphabetical order

The default is still descending order of label frequency, so there should be no impact to existing programs.

## How was this patch tested?
new test

Author: Wayne Zhang <actuaryzhang@uber.com>

Closes apache#17879 from actuaryzhang/stringIndexer.
asfgit pushed a commit that referenced this pull request May 21, 2017
## What changes were proposed in this pull request?
PySpark StringIndexer supports StringOrderType added in #17879.

Author: Wayne Zhang <actuaryzhang@uber.com>

Closes #17978 from actuaryzhang/PythonStringIndexer.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
## What changes were proposed in this pull request?

StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL.  For example, the ordering will affect the result in one-hot encoding and RFormula.

This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options:
- 'frequencyDesc': descending order by label frequency (most frequent label assigned 0)
- 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0)
- 'alphabetDesc': descending alphabetical order
- 'alphabetAsc': ascending alphabetical order

The default is still descending order of label frequency, so there should be no impact to existing programs.

## How was this patch tested?
new test

Author: Wayne Zhang <actuaryzhang@uber.com>

Closes apache#17879 from actuaryzhang/stringIndexer.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
## What changes were proposed in this pull request?
PySpark StringIndexer supports StringOrderType added in apache#17879.

Author: Wayne Zhang <actuaryzhang@uber.com>

Closes apache#17978 from actuaryzhang/PythonStringIndexer.
ghost pushed a commit to dbtsai/spark that referenced this pull request May 26, 2017
## What changes were proposed in this pull request?
When handling strings, the category dropped by RFormula and R are different:
- RFormula drops the least frequent level
- R drops the first level after ascending alphabetical ordering

This PR supports different string ordering types in StringIndexer apache#17879 so that RFormula can drop the same level as R when handling strings using`stringOrderType = "alphabetDesc"`.

## How was this patch tested?
new tests

Author: Wayne Zhang <actuaryzhang@uber.com>

Closes apache#17967 from actuaryzhang/RFormula.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants