Skip to content

[SPARK-14659][ML] RFormula consistent with R when handling strings #17967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

actuaryzhang
Copy link
Contributor

@actuaryzhang actuaryzhang commented May 12, 2017

What changes were proposed in this pull request?

When handling strings, the category dropped by RFormula and R are different:

  • RFormula drops the least frequent level
  • R drops the first level after ascending alphabetical ordering

This PR supports different string ordering types in StringIndexer #17879 so that RFormula can drop the same level as R when handling strings usingstringOrderType = "alphabetDesc".

How was this patch tested?

new tests

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76877 has finished for PR 17967 at commit 77fe864.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 12, 2017

@felixcheung @viirya @ericl @yinxusen
The following example illustrates the idea:

val original = Seq((1, "foo", 4), (2, "bar", 4), (3, "bar", 5), (4, "baz", 5)).toDF("id", "a", "b")
val formula = new RFormula().setFormula("id ~ a + b")
// 1) default: descending order by label frequency
formula.setStringOrderType("frequencyDesc").fit(original).transform(original)
+---+---+---+-------------+-----+
| id|  a|  b|     features|label|
+---+---+---+-------------+-----+
|  1|foo|  4|[0.0,0.0,4.0]|  1.0|
|  2|bar|  4|[1.0,0.0,4.0]|  2.0|
|  3|bar|  5|[1.0,0.0,5.0]|  3.0|
|  4|aaz|  5|[0.0,1.0,5.0]|  4.0|
+---+---+---+-------------+-----+
// 2) descending alphabetical order.
formula.setStringOrderType("alphabetDesc").fit(original).transform(original)
+---+---+---+-------------+-----+
| id|  a|  b|     features|label|
+---+---+---+-------------+-----+
|  1|foo|  4|[1.0,0.0,4.0]|  1.0|
|  2|bar|  4|[0.0,1.0,4.0]|  2.0|
|  3|bar|  5|[0.0,1.0,5.0]|  3.0|
|  4|aaz|  5|[0.0,0.0,5.0]|  4.0|
+---+---+---+-------------+-----+

Note the last one with stringOrderType = "alphabetDesc" drops the same category as R (i.e., "aaz" is dropped and treated as the reference level). The following is from R

df <- data.frame(id = c(1, 2, 3, 4),
                  a = c("foo", "bar", "bar", "aaz"),
                  b = c(4, 4, 5, 5))
model.matrix(id ~ a + b, df)[, -1]
   bar foo  b
    0    1  4
    1    0  4
    1    0  5
    0    0  5

@actuaryzhang actuaryzhang changed the title [SPARK-14659] RFormula allows to drop the same category as R when handling strings [SPARK-14659][ML] RFormula allows to drop the same category as R when handling strings May 12, 2017
@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 12, 2017

Some points to note:

  1. StringIndexer and RFormula use repetitive code when defining stringOrderType. Should this be moved to a trait? I did not do so since it was only used by the two transformers.
  2. When setting stringOrderType = "alphabetDesc", RFormula drops the same category as R. This makes sure Spark will produce the same model as base R. However, note that the ordering of the columns is still different, e.g., ("bar", "foo") in R and ("foo", "bar") in RFormula.

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76878 has finished for PR 17967 at commit a1be94c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang actuaryzhang changed the title [SPARK-14659][ML] RFormula allows to drop the same category as R when handling strings [SPARK-14659][ML] RFormula consistent with R when handling strings May 12, 2017
@felixcheung
Copy link
Member

felixcheung commented May 14, 2017

cool - I think this is important to have. do you have a higher level example of the old/new model output as compare to R as it is affected by the string ordering?

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment

*/
@Since("2.3.0")
final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
"how to order labels of string column. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be capitalize "How?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it, thanks!

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 14, 2017

@felixcheung Thanks for the review. I fixed some typo.
Below is an example to show the difference in model estimates due to different string ordering between R and RFormula.

val df = Seq((1.0, "foo", "a"), (2.0, "bar", "b"), (3.0, "bar", "b"), (4.0, "aaz", "b"),
    (4.2, "aaz", "b"), (1.6, "bar", "a")).toDF("id", "a", "b")
val formula = new RFormula().setFormula("id ~ a + b")
for (orderType <- Seq("frequencyDesc", "alphabetDesc")) {
 val df2 = formula.setStringOrderType(orderType).fit(df).transform(df)
 val model = new GeneralizedLinearRegression().fit(df2)
 val estimate = (model.coefficients.toArray :+ model.intercept)
 println(orderType + ": " + estimate.sortWith(_ < _).mkString(","))
}
frequencyDesc: 0.6,0.9,1.0,2.2
alphabetDesc: -2.2,-1.6,0.9,3.2

The following is the estimate from R, which is the same as stringOrderType = "alphabetDesc".

> df <- data.frame(id = c(1, 2, 3, 4, 4.2, 1.6),
+                   a = c("foo", "bar", "bar", "aaz", "aaz", "bar"),
+                   b = c("a", "b", "b", "b", "b", "a"))
> sort(coef(lm(id ~ a + b, data = df)))
       afoo        abar          bb (Intercept) 
       -2.2        -1.6         0.9         3.2 

@SparkQA
Copy link

SparkQA commented May 14, 2017

Test build #76913 has finished for PR 17967 at commit 698588e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

thanks for the example, I think that's very concrete that this change would be very useful

@actuaryzhang
Copy link
Contributor Author

@felixcheung Once this PR gets in, I'll update the SparkR side and include some test.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @yanboliang @jkbradley if you would like to review

@@ -37,6 +37,29 @@ import org.apache.spark.sql.types._
*/
private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {

/**
* Param for how to order labels of string column. The first label after ordering is assigned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a comment explaining which option is consistent with R?

@actuaryzhang
Copy link
Contributor Author

@viirya Great point. Added a comment to explain this in the doc.

@viirya
Copy link
Member

viirya commented May 19, 2017

LGTM

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77085 has finished for PR 17967 at commit 147311b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@actuaryzhang Thanks for this PR, and also thanks for reviewing from @felixcheung and @viirya . One quick question: Do you know what is the typical scenario to use alphabetDesc or alphabetAsc in R? Thanks.

@@ -37,6 +37,31 @@ import org.apache.spark.sql.types._
*/
private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {

/**
* Param for how to order labels of string column. The first label after ordering is assigned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mentions of label/labels in all comments, since we handle all string columns regardless of feature columns or label column. The current description may confused users.

* @group param
*/
@Since("2.3.0")
final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about orderTypeOfStringIndexer or any other suggestion? I think the param name stringOrderType in StringIndexer is clear enough, but RFormula involves lots of feature transformers across many steps, we should make users understand this param only takes effect on StringIndexer stage. And it's better to document more clear rather than copying corresponding section from StringIndexer. cc @felixcheung

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 19, 2017

@yanboliang Thanks for the question.

The alphabetically ascending order in R is very convenient for display purpose. For example, when you do a summary of model results, the results will be easier to understand if it is in alphabetically ascending order.

That's the default, but oftentimes users will reset the reference level to make the most frequent level as the base (the one dropped in one-hot encoding). This also facilitates interpretation, because the most frequent level can be roughly regarded as the population average (in very unbalanced data). Otherwise, especially in unbalanced data, the contrast between categories with few data is most times insignificant. Of course, this does not change the model, but it is very important for interpretation.

I understand that ordering string levels by descending frequency is helpful for other applications like tree based split decisions. But it will make the ML library much better if we can support these other options that are often used in day-to-day work. This will broaden the use case of Spark ML.

@actuaryzhang
Copy link
Contributor Author

@yanboliang Thanks for the review and suggestion. Makes lots of sense. I made a new commit to address these.

@SparkQA
Copy link

SparkQA commented May 20, 2017

Test build #77110 has finished for PR 17967 at commit 5f31d31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* The last category after ordering is dropped when encoding strings.
* The options are explained using an example string: 'b', 'a', 'b', 'a', 'c', 'b'
* |
* | Option | Category mapped to 0 by StringIndexer | Category dropped by RFormula
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to generate the right format, @HyukjinKwon do you know?
I understand not all markdown style is supported

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, up to my knowledge, it looks not. I guess @actuaryzhang meant to just write these out as they are? Let me double check by myself ....

Scaladoc

2017-05-20 1 40 35

Javadoc

2017-05-20 1 40 53

Copy link
Member

@HyukjinKwon HyukjinKwon May 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My humble suggestion is prose with simple - or resembling any other formats in this package if there are similar instances.

@actuaryzhang
Copy link
Contributor Author

@felixcheung @HyukjinKwon Thanks much for pointing out the documentation issues.
I still prefer to have a table to clearly illustrate what each option is doing.
Made a new commit to make this work. Now the doc looks like:

image

@SparkQA
Copy link

SparkQA commented May 20, 2017

Test build #77116 has finished for PR 17967 at commit 341949c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

(FWIW, {{{ ... }}} should work for Javadoc too given my past try - #15999 (comment))

@actuaryzhang
Copy link
Contributor Author

@HyukjinKwon @felixcheung I confirm it works for Javadoc.
image

@SparkQA
Copy link

SparkQA commented May 20, 2017

Test build #77130 has finished for PR 17967 at commit 24818a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

hmm, should we just use html <table>?

@@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
*/
private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {

/**
* Param for how to order categories of a string FEATURE column used by `StringIndexer`.
* The last category after ordering is dropped when encoding strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Do you have some references? AFAIK, R formula drop the first category alphabetically ascending order when encode string/category feature (which is consistent with your PR description). I think test("StringIndexer order types") in #17879 is correct. Could you double check this?

* | 'frequencyDesc' | most frequent category ('b') | least frequent category ('c') |
* | 'frequencyAsc' | least frequent category ('c') | most frequent category ('b') |
* | 'alphabetDesc' | first alphabetical category ('a') | last alphabetical category ('c') |
* | 'alphabetAsc' | last alphabetical category ('c') | first alphabetical category ('a')|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ascending is going up i.e. A-Z, descending is going down i.e. Z-A. So in this example, first alphabetical category is "c" if order by alphabetDesc, first alphabetical category is "a" if order by alphabetAsc. Please correct me if I have misunderstand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I'll fix this example.

* The last category after ordering is dropped when encoding strings.
* Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
* The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
* drops the same category as R when encoding strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order should be alphabetAsc to match R.

(i.e., "aaz" is treated as the reference level).
However, the column order is still different:
R renders the columns in ascending alphabetical order ("bar", "foo"), while
RFormula renders the columns in descending alphabetical order ("foo", "bar").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R and RFormula should behavior consistent if you fix the issue I mentioned above.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 22, 2017

@yanboliang I understand your points. The issue is OneHotEncoder only supports dropLast.
The ideal solution to match R exactly (both the category dropped and ordering of feature columns) will be use alphabetAsc in StringIndexer and dropFirst in OneHotEncoder.

Without changing OneHotEncoder to allow dropFirst, the best I can do in this PR is to match only the category that is dropped in R (i.e., use alphabetDesc and dropLast). This will make sure the model interpretation and magnitude of coefficients are consistent with R, but the ordering among the feature columns are still different, which is a minor issue IMO compared to making the category dropped consistent. That's also why I sorted the coefficients first in the example above to compare GLM results.

Please let me know if this is clear. I'm hesitating to change OneHotEncoder by adding a dropFirst option because that may lead to some disruption.

@actuaryzhang
Copy link
Contributor Author

@felixcheung Is the html tag <table> supported? Tried this but failed to compile...

*
* The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
* {{{
* +-----------------+---------------------------------------+----------------------------------+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to suggest just to write out as prose with a simple list if we are all fine for now, which I guess we would generally agree with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Would you please clarify what you mean by a list? Thanks.
I would like to preserve the table structure because it helps show the difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure, I initially meant a HTML list that we are already using -

* <ul>
* <li>`primitivesAsString` (default `false`): infers all primitive values as a string type</li>
* <li>`prefersDecimal` (default `false`): infers all floating-point values as a decimal
* type. If the values do not fit in decimal, then it infers them as doubles.</li>
* <li>`allowComments` (default `false`): ignores Java/C++ style comment in JSON records</li>
* <li>`allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names</li>
* <li>`allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* </li>
* <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers
* (e.g. 00012)</li>
* <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all
* character using backslash quoting mechanism</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts
* the malformed string into a field configured by `columnNameOfCorruptRecord`. To keep
* corrupt records, an user can set a string type field named `columnNameOfCorruptRecord`
* in an user-defined schema. If a schema does not have the field, it drops corrupt records
* during parsing. When inferring a schema, it implicitly adds a `columnNameOfCorruptRecord`
* field in an output schema.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* </li>
* <li>`columnNameOfCorruptRecord` (default is the value specified in
* `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string
* created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li>
* <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format.
* Custom date formats follow the formats at `java.text.SimpleDateFormat`. This applies to
* date type.</li>
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
* <li>`wholeFile` (default `false`): parse one record, which may span multiple lines,
* per file</li>
* </ul>
...

<ul>
  <li> abc </li>
  <li> abc </li>
</ul>

I just tested it to double-check a wiki-style list ( - ) - http://subnormalnumbers.blogspot.kr/2011/08/scaladoc-wiki-syntax.html. This does not work correctly as below (but please go ahead if you know any compatible way for both Scaladoc and Javadoc):

   *  1. item one
   *
   *  1. item two
   *    - sublist
   *    - next item
   *
   *  1. now for broken sub-numbered list, the leading item must be one of
   *     `-`, `1.`, `I.`, `i.`, `A.`, or `a.`. And it must be followed by a space.
   *    1. one
   *    2. two
   *    3. three
   *
   *  1. list types
   *    I. one
   *      i. one
   *      i. two
   *    I. two
   *      A. one
   *      A. two
   *    I. three
   *      a. one
   *      a. two

Scaladoc

2017-05-23 9 52 51

Javadoc

2017-05-23 9 53 07

My worry is, it draws attention with a different format. I believe we have similar instances but wonder if it is worth changing only this one. I would not strongly against but {{{ ... }}} basically means codes. If we can't find a better way to render this, I would leave this out as prose with a list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am not supposed to make a decision call though. Please let me know @felixcheung and @yanboliang if you have any preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for the clarification. I don't think list paints a clear picture here. Would rather keep the table structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to this, table is https://wiki.scala-lang.org/display/SW/Syntax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung @HyukjinKwon The scaladoc complied, but the javadoc failed... Not sure if there is additional config for java?

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think javadoc 8 complains about HTML. It looks this works:

   * <table summary="abc">
   * <tr>
   *   <th>Firstname</th>
   *   <th>Lastname</th>
   *   <th>Age</th>
   * </tr>
   * <tr>
   *   <td>Jill</td>
   *   <td>Smith</td>
   *   <td>50</td>
   * </tr>
   * <tr>
   *   <td>Eve</td>
   *   <td>Jackson</td>
   *   <td>94</td>
   *  </tr>
   * </table>

Scaladoc

2017-05-23 4 28 14

Javadoc

2017-05-23 4 27 55

Other errors probably are spurious (please refer https://issues.apache.org/jira/browse/SPARK-20840 which I am fighting with right now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Nice. Thanks much. One issue is in the scaladoc, the columns are very close to each other. How to add spacing between columns in the scaladoc?
I tried <table cellspacing="4"> but it does not seem to take any effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. it did not work for me too ...

@SparkQA
Copy link

SparkQA commented May 23, 2017

Test build #77209 has finished for PR 17967 at commit 1a1e06c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 23, 2017

@yanboliang I updated the example in the param doc. I hope it is clear now that RFormula with alphabetDesc in StringIndexer and dropLast in OneHotEncoder (default) drops the same category as R, i.e., the first alphabetic category.

@yanboliang
Copy link
Contributor

@actuaryzhang Thanks for your clarification, it makes sense. This looks good to me. @HyukjinKwon @felixcheung What do you think of the documentation issue?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 24, 2017

Personally, I would prefer prose, a HTML list or table one. But I am fine with the current status if this is okay to all of you here (as I guess none of them is particularly better given all the comments above).

@felixcheung
Copy link
Member

felixcheung commented May 24, 2017

I think a html table is better? #17967 (comment)
+ @srowen for your opinion- to be honest I don't think I've actually seen a table in Spark scaladoc/javadoc.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented May 24, 2017

I tried using <table cellspacing="4">, but in Scaladoc, it is not correctly formatted. I tried a few other options, but it seems the html attributes are ignored in Scaladoc. Below is the output from using html <table> tag.

image

@actuaryzhang
Copy link
Contributor Author

This is what we get from the current doc:

image

@felixcheung
Copy link
Member

given that I think I'm ok with an ascii table as a one time thing.
thoughts?

@actuaryzhang
Copy link
Contributor Author

@felixcheung @yanboliang I'm fine with either the ascii table or the html table. It's your call.
Hope to get over this minor doc issue and get this PR in soon. I can update the doc later if we find a better way. Thanks much.

@felixcheung
Copy link
Member

yes I'd hold this for a day.

@yanboliang
Copy link
Contributor

Merged into master, thanks for all.

@asfgit asfgit closed this in f47700c May 26, 2017
@actuaryzhang actuaryzhang deleted the RFormula branch May 26, 2017 05:38
asfgit pushed a commit that referenced this pull request May 30, 2017
…rmula

## What changes were proposed in this pull request?

PySpark supports stringIndexerOrderType in RFormula as in #17967.

## How was this patch tested?
docstring test

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18122 from actuaryzhang/PythonRFormula.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants