[SPARK-8601][ML] Add an option to disable standardization for linear regression #7875

dbtsai · 2015-08-02T22:18:05Z

All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.

In R, there is an option for this.
standardize

Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".

Note that the primary author for this PR is @holdenk

… an option to disable standardization (but for LoR).

…park-8601-in-Linear_regression

SparkQA · 2015-08-02T23:04:21Z

Test build #39448 has finished for PR 7875 at commit d6234ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-08-02T23:20:47Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

@@ -85,6 +85,18 @@ class LinearRegression(override val uid: String)
  setDefault(fitIntercept -> true)

  /**
+   * Whether to standardize the training features before fitting the model.
+   * The coefficients of models will be always returned on the original scale,
+   * so it will be transparent for users. Note that when no regularization,


probably s/when no/without/ or s/when no regularization/when no regularization is applied/ is a bit easier to read (but its a minor nit so no stress).

holdenk · 2015-08-02T23:24:49Z

Some minor nits but otherwise mostly LGTM :)

SparkQA · 2015-08-02T23:42:00Z

Test build #39459 has finished for PR 7875 at commit baa0805.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T00:06:30Z

Test build #39465 has finished for PR 7875 at commit bbff347.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T00:28:37Z

Test build #39472 has finished for PR 7875 at commit 596e96c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2015-08-04T03:02:38Z

cc @jkbradley Can you help me to code review this PR? This one is pretty much the same as 5722193 which was merged. Note that the primary author is @holdenk, so you may change the author. Thanks.

jkbradley · 2015-08-04T05:14:43Z

Sure, I can take a look tomorrow.

jkbradley · 2015-08-04T21:03:46Z

Making a pass now

jkbradley · 2015-08-04T21:50:42Z

I don't see any problems. One comment: Not in this PR, but in the future, it'd be nice to generalize a lot of the code to GLMs and reduce duplication between linreg and logreg. LGTM, though maybe I'll test once more since the code has changed quickly.

Jenkins test this please

SparkQA · 2015-08-04T22:31:05Z

Test build #1335 has finished for PR 7875 at commit 596e96c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-04T23:22:11Z

Test build #39771 has finished for PR 7875 at commit e856036.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2015-08-04T23:38:18Z

++1 for refactoring the code so LoR and LiR can share the duplicated code. I would like to work on it post 1.5.

jkbradley · 2015-08-05T01:15:00Z

Sounds good!
Merging this with master and branch-1.5

… regression All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same. In R, there is an option for this. standardize Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian". Note that the primary author for this PR is holdenk Author: Holden Karau <holden@pigscanfly.ca> Author: DB Tsai <dbt@netflix.com> Closes #7875 from dbtsai/SPARK-8522 and squashes the following commits: e856036 [DB Tsai] scala doc 596e96c [DB Tsai] minor bbff347 [DB Tsai] naming baa0805 [DB Tsai] touch up d6234ba [DB Tsai] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression 6b1dc09 [Holden Karau] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression 332f140 [Holden Karau] Merge in master eebe10a [Holden Karau] Use same comparision operator throughout the test 3f92935 [Holden Karau] merge b83a41e [Holden Karau] Expand the tests and make them similar to the other PR also providing an option to disable standardization (but for LoR). 0c334a2 [Holden Karau] Remove extra line 99ce053 [Holden Karau] merge in master e54a8a9 [Holden Karau] Fix long line e47c574 [Holden Karau] Add support for L2 without standardization. 55d3a66 [Holden Karau] Add standardization param for linear regression 00a1dc5 [Holden Karau] Add the param to the linearregression impl (cherry picked from commit d92fa14) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

holdenk and others added 13 commits June 26, 2015 01:23

Add the param to the linearregression impl

00a1dc5

Add standardization param for linear regression

55d3a66

Add support for L2 without standardization.

e47c574

Fix long line

e54a8a9

merge in master

99ce053

Remove extra line

0c334a2

Expand the tests and make them similar to the other PR also providing…

b83a41e

… an option to disable standardization (but for LoR).

merge

3f92935

Use same comparision operator throughout the test

eebe10a

Merge in master

332f140

Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-S…

6b1dc09

…park-8601-in-Linear_regression

Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-S…

d6234ba

…park-8601-in-Linear_regression

touch up

baa0805

naming

bbff347

holdenk reviewed Aug 2, 2015
View reviewed changes

minor

596e96c

scala doc

e856036

asfgit closed this in d92fa14 Aug 5, 2015

dbtsai deleted the SPARK-8522 branch September 15, 2015 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-8601][ML] Add an option to disable standardization for linear regression #7875

[SPARK-8601][ML] Add an option to disable standardization for linear regression #7875

dbtsai commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

holdenk Aug 2, 2015

Uh oh!

dbtsai Aug 2, 2015

Uh oh!

holdenk commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

dbtsai commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

dbtsai commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 5, 2015

Uh oh!

Uh oh!

[SPARK-8601][ML] Add an option to disable standardization for linear regression #7875

[SPARK-8601][ML] Add an option to disable standardization for linear regression #7875

Conversation

dbtsai commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

holdenk Aug 2, 2015

Choose a reason for hiding this comment

Uh oh!

dbtsai Aug 2, 2015

Choose a reason for hiding this comment

Uh oh!

holdenk commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 2, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

dbtsai commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

dbtsai commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 5, 2015

Uh oh!

Uh oh!