Skip to content

[SPARK-8601][ML] Add an option to disable standardization for linear regression #7875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

dbtsai
Copy link
Member

@dbtsai dbtsai commented Aug 2, 2015

All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.

In R, there is an option for this.
standardize

Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".

Note that the primary author for this PR is @holdenk

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #39448 has finished for PR 7875 at commit d6234ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -85,6 +85,18 @@ class LinearRegression(override val uid: String)
setDefault(fitIntercept -> true)

/**
* Whether to standardize the training features before fitting the model.
* The coefficients of models will be always returned on the original scale,
* so it will be transparent for users. Note that when no regularization,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably s/when no/without/ or s/when no regularization/when no regularization is applied/ is a bit easier to read (but its a minor nit so no stress).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@holdenk
Copy link
Contributor

holdenk commented Aug 2, 2015

Some minor nits but otherwise mostly LGTM :)

@SparkQA
Copy link

SparkQA commented Aug 2, 2015

Test build #39459 has finished for PR 7875 at commit baa0805.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39465 has finished for PR 7875 at commit bbff347.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39472 has finished for PR 7875 at commit 596e96c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member Author

dbtsai commented Aug 4, 2015

cc @jkbradley Can you help me to code review this PR? This one is pretty much the same as 5722193 which was merged. Note that the primary author is @holdenk, so you may change the author. Thanks.

@jkbradley
Copy link
Member

Sure, I can take a look tomorrow.

@jkbradley
Copy link
Member

Making a pass now

@jkbradley
Copy link
Member

I don't see any problems. One comment: Not in this PR, but in the future, it'd be nice to generalize a lot of the code to GLMs and reduce duplication between linreg and logreg. LGTM, though maybe I'll test once more since the code has changed quickly.

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Aug 4, 2015

Test build #1335 has finished for PR 7875 at commit 596e96c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 4, 2015

Test build #39771 has finished for PR 7875 at commit e856036.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member Author

dbtsai commented Aug 4, 2015

++1 for refactoring the code so LoR and LiR can share the duplicated code. I would like to work on it post 1.5.

@jkbradley
Copy link
Member

Sounds good!
Merging this with master and branch-1.5

asfgit pushed a commit that referenced this pull request Aug 5, 2015
… regression

All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.

In R, there is an option for this.
standardize

Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".

Note that the primary author for this PR is holdenk

Author: Holden Karau <holden@pigscanfly.ca>
Author: DB Tsai <dbt@netflix.com>

Closes #7875 from dbtsai/SPARK-8522 and squashes the following commits:

e856036 [DB Tsai] scala doc
596e96c [DB Tsai] minor
bbff347 [DB Tsai] naming
baa0805 [DB Tsai] touch up
d6234ba [DB Tsai] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression
6b1dc09 [Holden Karau] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression
332f140 [Holden Karau] Merge in master
eebe10a [Holden Karau] Use same comparision operator throughout the test
3f92935 [Holden Karau] merge
b83a41e [Holden Karau] Expand the tests and make them similar to the other PR also providing an option to disable standardization (but for LoR).
0c334a2 [Holden Karau] Remove extra line
99ce053 [Holden Karau] merge in master
e54a8a9 [Holden Karau] Fix long line
e47c574 [Holden Karau] Add support for L2 without standardization.
55d3a66 [Holden Karau] Add standardization param for linear regression
00a1dc5 [Holden Karau] Add the param to the linearregression impl

(cherry picked from commit d92fa14)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in d92fa14 Aug 5, 2015
@dbtsai dbtsai deleted the SPARK-8522 branch September 15, 2015 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants