-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-20619][ML] StringIndexer supports multiple ways to order label #17879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jkbradley @MLnick @holdenk @pnpritchard @yanboliang @sethah @imatiach-msft @srowen An example to illustrate the idea:
Below is the result corresponding to the different types of label ordering.
|
What would be some common cases where alphabet ordering would be needed? |
@holdenk The main motivation for this PR is that the behavior of StringIndexer will affect OneHotEncoder, RFormula and models estimated based on these transformers. There have been a few desired improvement in RFormula that could not be done without the change in StringIndexer. One use case for alphabetical ordering is to make comparison of Spark model results to that in R, which drops the first alphabetical value in one-hot encoding. Right now, even though we do lots of comparisons between Spark and R, we lack comparisons involving String features because the encoding is different. There is already a JIRA. Another motivation for this PR is to support ascending order by label frequency. This is also related to one-hot encoding. In practical applications of regression type models, it is almost always better to set the most frequent label as the reference level (i.e., drop the most frequent label in OneHotEncoding) for better interpretability. Right now, the behavior is the opposite and has made it very difficult to interpret results. I think the flexibility of different ordering will benefit a lot the downstream feature transformers and model estimators. Does this make sense? |
Test build #76517 has finished for PR 17879 at commit
|
@yanboliang Since you have reported a few issues due to different encoding between Spark and R (e.g., #SPARK-14659 and #SPARK-14657), probably you could add some comments? |
final val stringOrderType: Param[String] = new Param(this, "stringOrderType", | ||
"The method used to order values of input column. " + | ||
s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.", | ||
(value: String) => StringIndexer.supportedStringOrderType.contains(value.toLowerCase)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ParamValidators.inArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya ParamValidators.inArray
does not allow case-insensitive validation, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I originally thought you'd change to case-sensitive. It looks good to me.
.select(col($(inputCol)).cast(StringType)) | ||
.rdd.map(_.getString(0)) | ||
val labels = $(stringOrderType) match { | ||
case "freq_desc" => values.countByValue().toSeq.sortBy(-_._2).map(_._1).toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this setting is case-insensitive because your param validator doesn't care it, $(stringOrderType)
might be upper-case here. We may make it case-sensitive or do toLowerCase
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Great catch. Thanks!
@viirya Thanks much for your comments. Made a new commit to address them. |
|
||
/** @group getParam */ | ||
@Since("2.2.0") | ||
def getStringOrderType: String = $(stringOrderType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked other ML classes. Looks like for a case-insensitive setting, we may do toLowerCase in its public API:
def getStringOrderType: String = $(stringOrderType).toLowerCase
And you can use getStringOrderType
below instead of $(stringOrderType).toLowerCase
in fit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test build #76521 has finished for PR 17879 at commit
|
ping @yanboliang @felixcheung |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I want to double check https://github.com/apache/spark/pull/17879/files#r115116845
And given where we are now we should change this from 2.2 to 2.3
@@ -131,6 +163,12 @@ object StringIndexer extends DefaultParamsReadable[StringIndexer] { | |||
private[feature] val KEEP_INVALID: String = "keep" | |||
private[feature] val supportedHandleInvalids: Array[String] = | |||
Array(SKIP_INVALID, ERROR_INVALID, KEEP_INVALID) | |||
private[feature] val FREQ_DESC: String = "freq_desc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any prior standard for these names like freq_desc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung I did not find any prior standard, and am open to suggestion for better names.
Maybe better use frequency_desc
or count_desc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile thought?
@felixcheung Thanks. I will update the annotation. |
@actuaryzhang There seems something wrong with Github's webpage, so I can't directly reply the above comment. |
Test build #76619 has finished for PR 17879 at commit
|
Thanks much @felixcheung and @viirya. I have addressed your comments.
Please let me know if there is anything needed. Thanks! |
private[feature] val FREQ_DESC: String = "frequency_desc" | ||
private[feature] val FREQ_ASC: String = "frequency_asc" | ||
private[feature] val ALPHABET_DESC: String = "alphabet_desc" | ||
private[feature] val ALPHABET_ASC: String = "alphabet_asc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, we do not use underscore in the names. lowerCamelCase
is our rules for naming.
Thanks for ping me, @felixcheung
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Thanks much for the suggestion. Changed them to lowerCamelCase.
@felixcheung Any additional suggestions?
Test build #76621 has finished for PR 17879 at commit
|
Test build #76624 has finished for PR 17879 at commit
|
LGTM |
"how to order labels of string column. " + | ||
"The first label after ordering is assigned an index of 0. " + | ||
s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.", | ||
ParamValidators.inArray(StringIndexer.supportedStringOrderType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we are going to case sensitive then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Right. It does not quite make sense to be case insensitive now given that we now use camel case.
Test build #76628 has finished for PR 17879 at commit
|
@felixcheung @gatorsmile @MLnick @jkbradley @holdenk @yanboliang @srowen @sethah |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, did another pass.
@yanboliang @jkbradley do you want to have a look?
@felixcheung Thanks much for your review. |
@felixcheung Thanks much! |
merged to master. |
Thanks much! |
## What changes were proposed in this pull request? StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula. This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options: - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0) - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0) - 'alphabetDesc': descending alphabetical order - 'alphabetAsc': ascending alphabetical order The default is still descending order of label frequency, so there should be no impact to existing programs. ## How was this patch tested? new test Author: Wayne Zhang <actuaryzhang@uber.com> Closes apache#17879 from actuaryzhang/stringIndexer.
## What changes were proposed in this pull request? StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula. This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options: - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0) - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0) - 'alphabetDesc': descending alphabetical order - 'alphabetAsc': ascending alphabetical order The default is still descending order of label frequency, so there should be no impact to existing programs. ## How was this patch tested? new test Author: Wayne Zhang <actuaryzhang@uber.com> Closes apache#17879 from actuaryzhang/stringIndexer.
## What changes were proposed in this pull request? PySpark StringIndexer supports StringOrderType added in apache#17879. Author: Wayne Zhang <actuaryzhang@uber.com> Closes apache#17978 from actuaryzhang/PythonStringIndexer.
## What changes were proposed in this pull request? When handling strings, the category dropped by RFormula and R are different: - RFormula drops the least frequent level - R drops the first level after ascending alphabetical ordering This PR supports different string ordering types in StringIndexer apache#17879 so that RFormula can drop the same level as R when handling strings using`stringOrderType = "alphabetDesc"`. ## How was this patch tested? new tests Author: Wayne Zhang <actuaryzhang@uber.com> Closes apache#17967 from actuaryzhang/RFormula.
What changes were proposed in this pull request?
StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula.
This PR proposes to support other ordering methods and we add a parameter
stringOrderType
that supports the following four options:The default is still descending order of label frequency, so there should be no impact to existing programs.
How was this patch tested?
new test