Skip to content

[SPARK-7685][ML] Apply weights to different samples in Logistic Regression #7884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from

Conversation

dbtsai
Copy link
Member

@dbtsai dbtsai commented Aug 3, 2015

In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39507 has finished for PR 7884 at commit e83b444.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39510 has finished for PR 7884 at commit 4dfc09c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39508 has finished for PR 7884 at commit 9ac7518.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39512 has finished for PR 7884 at commit 8bbcfe6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39521 has finished for PR 7884 at commit 1e1d7a7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@jkbradley
Copy link
Member

I'll make a pass through this. (Please let me know if I should wait b/c of ongoing updates.)

@jkbradley
Copy link
Member

Initial thought: I don't think the user should need 2 parameters to specify whether to use sample weights.

What if, instead, there were no default weightCol? If the user specifies weightCol, then we use sample weights. If weightCol is unspecified or an empty String, then we don't use sample weights. Then we could eliminate HasWeightedSample.

@dbtsai
Copy link
Member Author

dbtsai commented Aug 4, 2015

@jkbradley Sounds fair. I will also add tests now. BTW, can you merge https://github.com/apache/spark/pull/7875/files which is already ready to be merged. I will have another PR to do weighted linear regression which depends on this PR. Thanks.

(label, sampleWeight, features)
})
} else {
Left(extractLabeledPoints(dataset).map {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to create LabeledPoints if they are just temporary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. This is copying the exiting code for abstract out getting the samples. I will remove it as you suggested.

@jkbradley
Copy link
Member

Thanks for adding weighted instance support---very useful!

@dbtsai
Copy link
Member Author

dbtsai commented Aug 4, 2015

This is very useful for us as well. Our training samples will have exponential decay weight as time goes by. I am trying to use it now, and it seems work well. Will be nice to have similar feature in GBDT which is used in our current production. Open a JIRA, and I can work on GBDT version.

@dbtsai
Copy link
Member Author

dbtsai commented Aug 4, 2015

Jenkins, please test again

@SparkQA
Copy link

SparkQA commented Aug 4, 2015

Test build #39688 has finished for PR 7884 at commit 05328f1.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

* Param for sample weight column name.
* @group param
*/
final val sampleWeightCol: Param[String] = new Param[String](this, "sampleWeightCol", "sample weight column name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call it weightCol and without default value? If this is defined, use the column specified as weights. Otherwise, treat all weights as 1.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is the current behavior. I will change the name to weightCol.

@jkbradley
Copy link
Member

Here's the link for the GBT JIRA: [https://issues.apache.org/jira/browse/SPARK-9612]

@dbtsai dbtsai changed the title [SPARK-7685][ML] Apply weights to different samples in Logistic Regression [SPARK-7685][ML][WIP] Apply weights to different samples in Logistic Regression Aug 4, 2015
@jkbradley
Copy link
Member

@dbtsai Earlier, I had recommended "If weightCol is unspecified or an empty String, then we don't use sample weights." It occurred to me that we should set the default to an empty String (since Params should have defaults or be required, in general).

@dbtsai
Copy link
Member Author

dbtsai commented Aug 4, 2015

@jkbradley and @mengxr I had weightCol as empty as default in my local branch. Sorry for confusion. This PR is meant to be WIP to see if we have enough time to make it in 1.5, and not intended to be merged. Thanks.

@rotationsymmetry
Copy link
Contributor

Following this PR, how about file a JIRA to implement weight for LinearRegression as well? I am thinking to use same weighting API, which will enable us to develop a general boosting algorithm with the Pipeline API. What do you think?

@dbtsai
Copy link
Member Author

dbtsai commented Aug 5, 2015

@rotationsymmetry Please help to create the JIRA. For LinearRegression, that should be smilier to this one. Once this one is merged, we can start to port it to LinearRegression.

@rotationsymmetry
Copy link
Contributor

@dbtsai Jira added https://issues.apache.org/jira/browse/SPARK-9642

Once logistic regression PR is merged, I will use it as the model and work on linear regression.

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #40013 has finished for PR 7884 at commit a0b15e6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #40017 has finished for PR 7884 at commit cd4cfbc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@SparkQA
Copy link

SparkQA commented Sep 14, 2015

Test build #42411 has finished for PR 7884 at commit 9acb9c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@jkbradley
Copy link
Member

+1 for "coefficients"

@jkbradley
Copy link
Member

I'll make another pass

@mengxr
Copy link
Contributor

mengxr commented Sep 14, 2015

FYI, I made https://issues.apache.org/jira/browse/SPARK-10592 for deprecating weights.

i += 1
}
Vectors.dense(realMean)
}

/**
* Sample variance of each dimension.
* Sample unbiased estimation of variance of each dimension.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to: "Unbiased estimate of sample variance"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to deprecate sample and use instance instead. What do you guys think?

@jkbradley
Copy link
Member

Can you please update the Scala doc to clarify:

  • count ignores instance weights
  • Add warning in training summary that it ignores the training weights currently (except for the objective trace).

Other than those small items, this looks good.

@dbtsai
Copy link
Member Author

dbtsai commented Sep 14, 2015

@jkbradley For count, let's do it later since it's a private API. Currently, count will return the actual number of instances, and ignores instance weights, but numNonzeros will return the weighted # of nonzeros. Let's decide the behavior of them, and modify them accordingly.

@dbtsai
Copy link
Member Author

dbtsai commented Sep 14, 2015

JIRA for the behavior of count and numNonZeors when we make it public. https://issues.apache.org/jira/browse/SPARK-10597

@SparkQA
Copy link

SparkQA commented Sep 14, 2015

Test build #42439 has finished for PR 7884 at commit 55a28b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

override def setThresholds(value: Array[Double]): this.type = super.setThresholds(value)

override def getThresholds: Array[Double] = super.getThresholds

override protected def train(dataset: DataFrame): LogisticRegressionModel = {
// Extract columns from data. If dataset is persisted, do not persist oldDataset.
val instances = extractLabeledPoints(dataset).map {
case LabeledPoint(label: Double, features: Vector) => (label, features)
val instances: RDD[Instance] = if ($(weightCol).isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my previous comment, using val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol)) could simplify this block.

@SparkQA
Copy link

SparkQA commented Sep 15, 2015

Test build #42471 has finished for PR 7884 at commit a66833b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@mengxr
Copy link
Contributor

mengxr commented Sep 15, 2015

LGTM but there are some merge conflicts in MimaExcludes. Could you fix them? Thanks!

@SparkQA
Copy link

SparkQA commented Sep 15, 2015

Test build #42497 has finished for PR 7884 at commit 8d6de99.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@@ -73,7 +81,11 @@ object MimaExcludes {
"org.apache.spark.ml.regression.LeastSquaresCostFun.this"),
ProblemFilters.exclude[MissingMethodProblem](
"org.apache.spark.ml.classification.LogisticCostFun.this"),
// SQL execution is considered private.
ProblemFilters.exclude[MissingMethodProblem](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they necessary for 1.5? It would be nice to group rules from LogisticAggregator.

@SparkQA
Copy link

SparkQA commented Sep 15, 2015

Test build #42502 has finished for PR 7884 at commit f53436a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@SparkQA
Copy link

SparkQA commented Sep 15, 2015

Test build #42501 has finished for PR 7884 at commit e792b6a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val probabilityCol: Param[String] = new Param[String](this, "probabilityCol", "Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities")

@mengxr
Copy link
Contributor

mengxr commented Sep 15, 2015

Two failed tests are from core and not related to this PR:

  • Unpersisting TorrentBroadcast on executors and driver in distributed mode *** FAILED *** (11 seconds, 462 milliseconds)
  • mesos supports killing and limiting executors *** FAILED *** (29 milliseconds)

I'm going to merge this.

@mengxr
Copy link
Contributor

mengxr commented Sep 15, 2015

Merged into master. Thanks!

@asfgit asfgit closed this in be52faa Sep 15, 2015
@dbtsai dbtsai deleted the SPARK-7685 branch September 15, 2015 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants