-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-14659][ML] RFormula consistent with R when handling strings #17967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #76877 has finished for PR 17967 at commit
|
@felixcheung @viirya @ericl @yinxusen
Note the last one with
|
Some points to note:
|
Test build #76878 has finished for PR 17967 at commit
|
cool - I think this is important to have. do you have a higher level example of the old/new model output as compare to R as it is affected by the string ordering? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a comment
*/ | ||
@Since("2.3.0") | ||
final val stringOrderType: Param[String] = new Param(this, "stringOrderType", | ||
"how to order labels of string column. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be capitalize "How
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it, thanks!
@felixcheung Thanks for the review. I fixed some typo.
The following is the estimate from R, which is the same as
|
Test build #76913 has finished for PR 17967 at commit
|
thanks for the example, I think that's very concrete that this change would be very useful |
@felixcheung Once this PR gets in, I'll update the SparkR side and include some test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @yanboliang @jkbradley if you would like to review
@@ -37,6 +37,29 @@ import org.apache.spark.sql.types._ | |||
*/ | |||
private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol { | |||
|
|||
/** | |||
* Param for how to order labels of string column. The first label after ordering is assigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a comment explaining which option is consistent with R?
@viirya Great point. Added a comment to explain this in the doc. |
LGTM |
Test build #77085 has finished for PR 17967 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@actuaryzhang Thanks for this PR, and also thanks for reviewing from @felixcheung and @viirya . One quick question: Do you know what is the typical scenario to use alphabetDesc
or alphabetAsc
in R? Thanks.
@@ -37,6 +37,31 @@ import org.apache.spark.sql.types._ | |||
*/ | |||
private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol { | |||
|
|||
/** | |||
* Param for how to order labels of string column. The first label after ordering is assigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mentions of label/labels
in all comments, since we handle all string columns regardless of feature columns or label column. The current description may confused users.
* @group param | ||
*/ | ||
@Since("2.3.0") | ||
final val stringOrderType: Param[String] = new Param(this, "stringOrderType", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about orderTypeOfStringIndexer
or any other suggestion? I think the param name stringOrderType
in StringIndexer
is clear enough, but RFormula
involves lots of feature transformers across many steps, we should make users understand this param only takes effect on StringIndexer
stage. And it's better to document more clear rather than copying corresponding section from StringIndexer
. cc @felixcheung
@yanboliang Thanks for the question. The alphabetically ascending order in R is very convenient for display purpose. For example, when you do a summary of model results, the results will be easier to understand if it is in alphabetically ascending order. That's the default, but oftentimes users will reset the reference level to make the most frequent level as the base (the one dropped in one-hot encoding). This also facilitates interpretation, because the most frequent level can be roughly regarded as the population average (in very unbalanced data). Otherwise, especially in unbalanced data, the contrast between categories with few data is most times insignificant. Of course, this does not change the model, but it is very important for interpretation. I understand that ordering string levels by descending frequency is helpful for other applications like tree based split decisions. But it will make the ML library much better if we can support these other options that are often used in day-to-day work. This will broaden the use case of Spark ML. |
@yanboliang Thanks for the review and suggestion. Makes lots of sense. I made a new commit to address these. |
Test build #77110 has finished for PR 17967 at commit
|
* The last category after ordering is dropped when encoding strings. | ||
* The options are explained using an example string: 'b', 'a', 'b', 'a', 'c', 'b' | ||
* | | ||
* | Option | Category mapped to 0 by StringIndexer | Category dropped by RFormula |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this going to generate the right format, @HyukjinKwon do you know?
I understand not all markdown style is supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, up to my knowledge, it looks not. I guess @actuaryzhang meant to just write these out as they are? Let me double check by myself ....
Scaladoc
Javadoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My humble suggestion is prose with simple -
or resembling any other formats in this package if there are similar instances.
@felixcheung @HyukjinKwon Thanks much for pointing out the documentation issues. |
Test build #77116 has finished for PR 17967 at commit
|
(FWIW, |
@HyukjinKwon @felixcheung I confirm it works for Javadoc. |
Test build #77130 has finished for PR 17967 at commit
|
hmm, should we just use html |
@@ -37,6 +37,42 @@ import org.apache.spark.sql.types._ | |||
*/ | |||
private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol { | |||
|
|||
/** | |||
* Param for how to order categories of a string FEATURE column used by `StringIndexer`. | |||
* The last category after ordering is dropped when encoding strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Do you have some references? AFAIK, R formula drop the first category alphabetically ascending order when encode string/category feature (which is consistent with your PR description). I think test("StringIndexer order types")
in #17879 is correct. Could you double check this?
* | 'frequencyDesc' | most frequent category ('b') | least frequent category ('c') | | ||
* | 'frequencyAsc' | least frequent category ('c') | most frequent category ('b') | | ||
* | 'alphabetDesc' | first alphabetical category ('a') | last alphabetical category ('c') | | ||
* | 'alphabetAsc' | last alphabetical category ('c') | first alphabetical category ('a')| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ascending
is going up i.e. A-Z, descending
is going down i.e. Z-A. So in this example, first alphabetical category is "c" if order by alphabetDesc
, first alphabetical category is "a" if order by alphabetAsc
. Please correct me if I have misunderstand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. I'll fix this example.
* The last category after ordering is dropped when encoding strings. | ||
* Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'. | ||
* The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula` | ||
* drops the same category as R when encoding strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order should be alphabetAsc
to match R.
(i.e., "aaz" is treated as the reference level). | ||
However, the column order is still different: | ||
R renders the columns in ascending alphabetical order ("bar", "foo"), while | ||
RFormula renders the columns in descending alphabetical order ("foo", "bar"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R and RFormula should behavior consistent if you fix the issue I mentioned above.
@yanboliang I understand your points. The issue is Without changing Please let me know if this is clear. I'm hesitating to change |
@felixcheung Is the html tag |
* | ||
* The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`: | ||
* {{{ | ||
* +-----------------+---------------------------------------+----------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to suggest just to write out as prose with a simple list if we are all fine for now, which I guess we would generally agree with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Would you please clarify what you mean by a list
? Thanks.
I would like to preserve the table structure because it helps show the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure, I initially meant a HTML list that we are already using -
spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
Lines 304 to 340 in 04901dd
* <ul> | |
* <li>`primitivesAsString` (default `false`): infers all primitive values as a string type</li> | |
* <li>`prefersDecimal` (default `false`): infers all floating-point values as a decimal | |
* type. If the values do not fit in decimal, then it infers them as doubles.</li> | |
* <li>`allowComments` (default `false`): ignores Java/C++ style comment in JSON records</li> | |
* <li>`allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names</li> | |
* <li>`allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes | |
* </li> | |
* <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers | |
* (e.g. 00012)</li> | |
* <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all | |
* character using backslash quoting mechanism</li> | |
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records | |
* during parsing. | |
* <ul> | |
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts | |
* the malformed string into a field configured by `columnNameOfCorruptRecord`. To keep | |
* corrupt records, an user can set a string type field named `columnNameOfCorruptRecord` | |
* in an user-defined schema. If a schema does not have the field, it drops corrupt records | |
* during parsing. When inferring a schema, it implicitly adds a `columnNameOfCorruptRecord` | |
* field in an output schema.</li> | |
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li> | |
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li> | |
* </ul> | |
* </li> | |
* <li>`columnNameOfCorruptRecord` (default is the value specified in | |
* `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string | |
* created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li> | |
* <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format. | |
* Custom date formats follow the formats at `java.text.SimpleDateFormat`. This applies to | |
* date type.</li> | |
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that | |
* indicates a timestamp format. Custom date formats follow the formats at | |
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li> | |
* <li>`wholeFile` (default `false`): parse one record, which may span multiple lines, | |
* per file</li> | |
* </ul> |
<ul>
<li> abc </li>
<li> abc </li>
</ul>
I just tested it to double-check a wiki-style list ( -
) - http://subnormalnumbers.blogspot.kr/2011/08/scaladoc-wiki-syntax.html. This does not work correctly as below (but please go ahead if you know any compatible way for both Scaladoc and Javadoc):
* 1. item one
*
* 1. item two
* - sublist
* - next item
*
* 1. now for broken sub-numbered list, the leading item must be one of
* `-`, `1.`, `I.`, `i.`, `A.`, or `a.`. And it must be followed by a space.
* 1. one
* 2. two
* 3. three
*
* 1. list types
* I. one
* i. one
* i. two
* I. two
* A. one
* A. two
* I. three
* a. one
* a. two
Scaladoc
Javadoc
My worry is, it draws attention with a different format. I believe we have similar instances but wonder if it is worth changing only this one. I would not strongly against but {{{ ... }}}
basically means codes. If we can't find a better way to render this, I would leave this out as prose with a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am not supposed to make a decision call though. Please let me know @felixcheung and @yanboliang if you have any preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for the clarification. I don't think list
paints a clear picture here. Would rather keep the table structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to this, table is https://wiki.scala-lang.org/display/SW/Syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung @HyukjinKwon The scaladoc complied, but the javadoc failed... Not sure if there is additional config for java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think javadoc 8 complains about HTML. It looks this works:
* <table summary="abc">
* <tr>
* <th>Firstname</th>
* <th>Lastname</th>
* <th>Age</th>
* </tr>
* <tr>
* <td>Jill</td>
* <td>Smith</td>
* <td>50</td>
* </tr>
* <tr>
* <td>Eve</td>
* <td>Jackson</td>
* <td>94</td>
* </tr>
* </table>
Scaladoc
Javadoc
Other errors probably are spurious (please refer https://issues.apache.org/jira/browse/SPARK-20840 which I am fighting with right now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Nice. Thanks much. One issue is in the scaladoc, the columns are very close to each other. How to add spacing between columns in the scaladoc?
I tried <table cellspacing="4">
but it does not seem to take any effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. it did not work for me too ...
Test build #77209 has finished for PR 17967 at commit
|
@yanboliang I updated the example in the param doc. I hope it is clear now that RFormula with |
@actuaryzhang Thanks for your clarification, it makes sense. This looks good to me. @HyukjinKwon @felixcheung What do you think of the documentation issue? |
Personally, I would prefer prose, a HTML list or table one. But I am fine with the current status if this is okay to all of you here (as I guess none of them is particularly better given all the comments above). |
I think a html table is better? #17967 (comment) |
given that I think I'm ok with an ascii table as a one time thing. |
@felixcheung @yanboliang I'm fine with either the ascii table or the html table. It's your call. |
yes I'd hold this for a day. |
Merged into master, thanks for all. |
What changes were proposed in this pull request?
When handling strings, the category dropped by RFormula and R are different:
This PR supports different string ordering types in StringIndexer #17879 so that RFormula can drop the same level as R when handling strings using
stringOrderType = "alphabetDesc"
.How was this patch tested?
new tests