-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-8601][ML] Add an option to disable standardization for linear regression #7875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… an option to disable standardization (but for LoR).
…park-8601-in-Linear_regression
…park-8601-in-Linear_regression
Test build #39448 has finished for PR 7875 at commit
|
@@ -85,6 +85,18 @@ class LinearRegression(override val uid: String) | |||
setDefault(fitIntercept -> true) | |||
|
|||
/** | |||
* Whether to standardize the training features before fitting the model. | |||
* The coefficients of models will be always returned on the original scale, | |||
* so it will be transparent for users. Note that when no regularization, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably s/when no/without/ or s/when no regularization/when no regularization is applied/ is a bit easier to read (but its a minor nit so no stress).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Some minor nits but otherwise mostly LGTM :) |
Test build #39459 has finished for PR 7875 at commit
|
Test build #39465 has finished for PR 7875 at commit
|
Test build #39472 has finished for PR 7875 at commit
|
cc @jkbradley Can you help me to code review this PR? This one is pretty much the same as 5722193 which was merged. Note that the primary author is @holdenk, so you may change the author. Thanks. |
Sure, I can take a look tomorrow. |
Making a pass now |
I don't see any problems. One comment: Not in this PR, but in the future, it'd be nice to generalize a lot of the code to GLMs and reduce duplication between linreg and logreg. LGTM, though maybe I'll test once more since the code has changed quickly. Jenkins test this please |
Test build #1335 has finished for PR 7875 at commit
|
Test build #39771 has finished for PR 7875 at commit
|
++1 for refactoring the code so LoR and LiR can share the duplicated code. I would like to work on it post 1.5. |
Sounds good! |
… regression All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same. In R, there is an option for this. standardize Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian". Note that the primary author for this PR is holdenk Author: Holden Karau <holden@pigscanfly.ca> Author: DB Tsai <dbt@netflix.com> Closes #7875 from dbtsai/SPARK-8522 and squashes the following commits: e856036 [DB Tsai] scala doc 596e96c [DB Tsai] minor bbff347 [DB Tsai] naming baa0805 [DB Tsai] touch up d6234ba [DB Tsai] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression 6b1dc09 [Holden Karau] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression 332f140 [Holden Karau] Merge in master eebe10a [Holden Karau] Use same comparision operator throughout the test 3f92935 [Holden Karau] merge b83a41e [Holden Karau] Expand the tests and make them similar to the other PR also providing an option to disable standardization (but for LoR). 0c334a2 [Holden Karau] Remove extra line 99ce053 [Holden Karau] merge in master e54a8a9 [Holden Karau] Fix long line e47c574 [Holden Karau] Add support for L2 without standardization. 55d3a66 [Holden Karau] Add standardization param for linear regression 00a1dc5 [Holden Karau] Add the param to the linearregression impl (cherry picked from commit d92fa14) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.
In R, there is an option for this.
standardize
Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".
Note that the primary author for this PR is @holdenk