[SPARK-7685][ML] Apply weights to different samples in Logistic Regression #7884

dbtsai · 2015-08-03T04:40:54Z

In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.

SparkQA · 2015-08-03T05:00:02Z

Test build #39507 has finished for PR 7884 at commit e83b444.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T05:10:38Z

Test build #39510 has finished for PR 7884 at commit 4dfc09c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

SparkQA · 2015-08-03T05:11:17Z

Test build #39508 has finished for PR 7884 at commit 9ac7518.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

SparkQA · 2015-08-03T05:17:44Z

Test build #39512 has finished for PR 7884 at commit 8bbcfe6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

SparkQA · 2015-08-03T06:20:59Z

Test build #39521 has finished for PR 7884 at commit 1e1d7a7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

jkbradley · 2015-08-04T01:47:12Z

I'll make a pass through this. (Please let me know if I should wait b/c of ongoing updates.)

jkbradley · 2015-08-04T01:54:36Z

Initial thought: I don't think the user should need 2 parameters to specify whether to use sample weights.

What if, instead, there were no default weightCol? If the user specifies weightCol, then we use sample weights. If weightCol is unspecified or an empty String, then we don't use sample weights. Then we could eliminate HasWeightedSample.

dbtsai · 2015-08-04T03:16:24Z

@jkbradley Sounds fair. I will also add tests now. BTW, can you merge https://github.com/apache/spark/pull/7875/files which is already ready to be merged. I will have another PR to do weighted linear regression which depends on this PR. Thanks.

jkbradley · 2015-08-04T03:36:54Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

+            (label, sampleWeight, features)
+        })
+      } else {
+        Left(extractLabeledPoints(dataset).map {


No need to create LabeledPoints if they are just temporary.

Okay. This is copying the exiting code for abstract out getting the samples. I will remove it as you suggested.

jkbradley · 2015-08-04T03:37:13Z

Thanks for adding weighted instance support---very useful!

dbtsai · 2015-08-04T07:16:01Z

This is very useful for us as well. Our training samples will have exponential decay weight as time goes by. I am trying to use it now, and it seems work well. Will be nice to have similar feature in GBDT which is used in our current production. Open a JIRA, and I can work on GBDT version.

dbtsai · 2015-08-04T07:34:48Z

Jenkins, please test again

SparkQA · 2015-08-04T08:10:13Z

Test build #39688 has finished for PR 7884 at commit 05328f1.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

mengxr · 2015-08-04T20:09:58Z

mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala

+   * Param for sample weight column name.
+   * @group param
+   */
+  final val sampleWeightCol: Param[String] = new Param[String](this, "sampleWeightCol", "sample weight column name")


Shall we call it weightCol and without default value? If this is defined, use the column specified as weights. Otherwise, treat all weights as 1.0.

Yes. This is the current behavior. I will change the name to weightCol.

jkbradley · 2015-08-04T20:30:45Z

Here's the link for the GBT JIRA: [https://issues.apache.org/jira/browse/SPARK-9612]

jkbradley · 2015-08-04T21:37:46Z

@dbtsai Earlier, I had recommended "If weightCol is unspecified or an empty String, then we don't use sample weights." It occurred to me that we should set the default to an empty String (since Params should have defaults or be required, in general).

dbtsai · 2015-08-04T21:47:23Z

@jkbradley and @mengxr I had weightCol as empty as default in my local branch. Sorry for confusion. This PR is meant to be WIP to see if we have enough time to make it in 1.5, and not intended to be merged. Thanks.

rotationsymmetry · 2015-08-05T04:45:45Z

Following this PR, how about file a JIRA to implement weight for LinearRegression as well? I am thinking to use same weighting API, which will enable us to develop a general boosting algorithm with the Pipeline API. What do you think?

dbtsai · 2015-08-05T04:51:58Z

@rotationsymmetry Please help to create the JIRA. For LinearRegression, that should be smilier to this one. Once this one is merged, we can start to port it to LinearRegression.

rotationsymmetry · 2015-08-05T16:02:28Z

@dbtsai Jira added https://issues.apache.org/jira/browse/SPARK-9642

Once logistic regression PR is merged, I will use it as the model and work on linear regression.

SparkQA · 2015-08-06T07:56:07Z

Test build #40013 has finished for PR 7884 at commit a0b15e6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

SparkQA · 2015-08-06T08:12:21Z

Test build #40017 has finished for PR 7884 at commit cd4cfbc.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

SparkQA · 2015-09-14T09:27:48Z

Test build #42411 has finished for PR 7884 at commit 9acb9c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

jkbradley · 2015-09-14T18:04:05Z

+1 for "coefficients"

jkbradley · 2015-09-14T18:09:24Z

I'll make another pass

mengxr · 2015-09-14T18:43:05Z

FYI, I made https://issues.apache.org/jira/browse/SPARK-10592 for deprecating weights.

jkbradley · 2015-09-14T19:24:03Z

mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala

      i += 1
    }
    Vectors.dense(realMean)
  }

  /**
-   * Sample variance of each dimension.
+   * Sample unbiased estimation of variance of each dimension.


Change to: "Unbiased estimate of sample variance"

I would like to deprecate sample and use instance instead. What do you guys think?

jkbradley · 2015-09-14T19:24:43Z

Can you please update the Scala doc to clarify:

count ignores instance weights
Add warning in training summary that it ignores the training weights currently (except for the objective trace).

Other than those small items, this looks good.

dbtsai · 2015-09-14T20:03:47Z

@jkbradley For count, let's do it later since it's a private API. Currently, count will return the actual number of instances, and ignores instance weights, but numNonzeros will return the weighted # of nonzeros. Let's decide the behavior of them, and modify them accordingly.

dbtsai · 2015-09-14T20:16:32Z

JIRA for the behavior of count and numNonZeors when we make it public. https://issues.apache.org/jira/browse/SPARK-10597

SparkQA · 2015-09-14T22:52:43Z

Test build #42439 has finished for PR 7884 at commit 55a28b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

mengxr · 2015-09-15T04:50:55Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

  override def setThresholds(value: Array[Double]): this.type = super.setThresholds(value)

  override def getThresholds: Array[Double] = super.getThresholds

  override protected def train(dataset: DataFrame): LogisticRegressionModel = {
    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
-    val instances = extractLabeledPoints(dataset).map {
-      case LabeledPoint(label: Double, features: Vector) => (label, features)
+    val instances: RDD[Instance] = if ($(weightCol).isEmpty) {


From my previous comment, using val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol)) could simplify this block.

SparkQA · 2015-09-15T08:29:02Z

Test build #42471 has finished for PR 7884 at commit a66833b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

mengxr · 2015-09-15T15:33:58Z

LGTM but there are some merge conflicts in MimaExcludes. Could you fix them? Thanks!

SparkQA · 2015-09-15T19:57:49Z

Test build #42497 has finished for PR 7884 at commit 8d6de99.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

mengxr · 2015-09-15T20:13:48Z

project/MimaExcludes.scala

@@ -73,7 +81,11 @@ object MimaExcludes {
          "org.apache.spark.ml.regression.LeastSquaresCostFun.this"),
        ProblemFilters.exclude[MissingMethodProblem](
          "org.apache.spark.ml.classification.LogisticCostFun.this"),
-        // SQL execution is considered private.
+        ProblemFilters.exclude[MissingMethodProblem](


Are they necessary for 1.5? It would be nice to group rules from LogisticAggregator.

SparkQA · 2015-09-15T22:31:28Z

Test build #42502 has finished for PR 7884 at commit f53436a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

SparkQA · 2015-09-15T22:37:49Z

Test build #42501 has finished for PR 7884 at commit e792b6a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

mengxr · 2015-09-15T22:45:00Z

Two failed tests are from core and not related to this PR:

Unpersisting TorrentBroadcast on executors and driver in distributed mode *** FAILED *** (11 seconds, 462 milliseconds)
mesos supports killing and limiting executors *** FAILED *** (29 milliseconds)

I'm going to merge this.

mengxr · 2015-09-15T22:47:02Z

Merged into master. Thanks!

jkbradley reviewed Aug 4, 2015
View reviewed changes

mengxr reviewed Aug 4, 2015
View reviewed changes

dbtsai changed the title ~~[SPARK-7685][ML] Apply weights to different samples in Logistic Regression~~ [SPARK-7685][ML][WIP] Apply weights to different samples in Logistic Regression Aug 4, 2015

dbtsai force-pushed the SPARK-7685 branch from 05328f1 to a0b15e6 Compare August 6, 2015 07:52

dbtsai force-pushed the SPARK-7685 branch from e41ece1 to cd4cfbc Compare August 6, 2015 08:06

first commit

8e0df52

jkbradley reviewed Sep 14, 2015
View reviewed changes

addressed feedback

55a28b9

mengxr reviewed Sep 15, 2015
View reviewed changes

DB Tsai added 2 commits September 14, 2015 22:45

address feedback

f6c12de

Merge branch 'SPARK-7685' of github.com:dbtsai/spark into SPARK-7685

a66833b

DB Tsai added 2 commits September 15, 2015 11:23

Merge branch 'master' into SPARK-7685

b4a020d

mima style

8d6de99

DB Tsai added 2 commits September 15, 2015 13:11

change mima

e792b6a

remove 1.5 mima

f53436a

mengxr reviewed Sep 15, 2015
View reviewed changes

asfgit closed this in be52faa Sep 15, 2015

dbtsai deleted the SPARK-7685 branch September 15, 2015 23:50

[SPARK-7685][ML] Apply weights to different samples in Logistic Regression #7884

[SPARK-7685][ML] Apply weights to different samples in Logistic Regression #7884

Uh oh!

Conversation

dbtsai commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

dbtsai commented Aug 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

dbtsai commented Aug 4, 2015

Uh oh!

dbtsai commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

jkbradley commented Aug 4, 2015

Uh oh!

dbtsai commented Aug 4, 2015

Uh oh!

rotationsymmetry commented Aug 5, 2015

Uh oh!

dbtsai commented Aug 5, 2015

Uh oh!

rotationsymmetry commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 6, 2015

Uh oh!

SparkQA commented Aug 6, 2015

Uh oh!

SparkQA commented Sep 14, 2015

Uh oh!

jkbradley commented Sep 14, 2015

Uh oh!

jkbradley commented Sep 14, 2015

Uh oh!

mengxr commented Sep 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Sep 14, 2015

Uh oh!

dbtsai commented Sep 14, 2015

Uh oh!

dbtsai commented Sep 14, 2015

Uh oh!

SparkQA commented Sep 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 15, 2015

Uh oh!

mengxr commented Sep 15, 2015

Uh oh!