-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-7685][ML] Apply weights to different samples in Logistic Regression #7884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #39507 has finished for PR 7884 at commit
|
Test build #39510 has finished for PR 7884 at commit
|
Test build #39508 has finished for PR 7884 at commit
|
Test build #39512 has finished for PR 7884 at commit
|
Test build #39521 has finished for PR 7884 at commit
|
I'll make a pass through this. (Please let me know if I should wait b/c of ongoing updates.) |
Initial thought: I don't think the user should need 2 parameters to specify whether to use sample weights. What if, instead, there were no default weightCol? If the user specifies weightCol, then we use sample weights. If weightCol is unspecified or an empty String, then we don't use sample weights. Then we could eliminate HasWeightedSample. |
@jkbradley Sounds fair. I will also add tests now. BTW, can you merge https://github.com/apache/spark/pull/7875/files which is already ready to be merged. I will have another PR to do weighted linear regression which depends on this PR. Thanks. |
(label, sampleWeight, features) | ||
}) | ||
} else { | ||
Left(extractLabeledPoints(dataset).map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to create LabeledPoints if they are just temporary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. This is copying the exiting code for abstract out getting the samples. I will remove it as you suggested.
Thanks for adding weighted instance support---very useful! |
This is very useful for us as well. Our training samples will have exponential decay weight as time goes by. I am trying to use it now, and it seems work well. Will be nice to have similar feature in GBDT which is used in our current production. Open a JIRA, and I can work on GBDT version. |
Jenkins, please test again |
Test build #39688 has finished for PR 7884 at commit
|
* Param for sample weight column name. | ||
* @group param | ||
*/ | ||
final val sampleWeightCol: Param[String] = new Param[String](this, "sampleWeightCol", "sample weight column name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we call it weightCol
and without default value? If this is defined, use the column specified as weights. Otherwise, treat all weights as 1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is the current behavior. I will change the name to weightCol
.
Here's the link for the GBT JIRA: [https://issues.apache.org/jira/browse/SPARK-9612] |
@dbtsai Earlier, I had recommended "If weightCol is unspecified or an empty String, then we don't use sample weights." It occurred to me that we should set the default to an empty String (since Params should have defaults or be required, in general). |
@jkbradley and @mengxr I had |
Following this PR, how about file a JIRA to implement weight for |
@rotationsymmetry Please help to create the JIRA. For |
@dbtsai Jira added https://issues.apache.org/jira/browse/SPARK-9642 Once logistic regression PR is merged, I will use it as the model and work on linear regression. |
Test build #40013 has finished for PR 7884 at commit
|
Test build #40017 has finished for PR 7884 at commit
|
Test build #42411 has finished for PR 7884 at commit
|
+1 for "coefficients" |
I'll make another pass |
FYI, I made https://issues.apache.org/jira/browse/SPARK-10592 for deprecating |
i += 1 | ||
} | ||
Vectors.dense(realMean) | ||
} | ||
|
||
/** | ||
* Sample variance of each dimension. | ||
* Sample unbiased estimation of variance of each dimension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to: "Unbiased estimate of sample variance"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to deprecate sample and use instance instead. What do you guys think?
Can you please update the Scala doc to clarify:
Other than those small items, this looks good. |
@jkbradley For |
JIRA for the behavior of |
Test build #42439 has finished for PR 7884 at commit
|
override def setThresholds(value: Array[Double]): this.type = super.setThresholds(value) | ||
|
||
override def getThresholds: Array[Double] = super.getThresholds | ||
|
||
override protected def train(dataset: DataFrame): LogisticRegressionModel = { | ||
// Extract columns from data. If dataset is persisted, do not persist oldDataset. | ||
val instances = extractLabeledPoints(dataset).map { | ||
case LabeledPoint(label: Double, features: Vector) => (label, features) | ||
val instances: RDD[Instance] = if ($(weightCol).isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my previous comment, using val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol))
could simplify this block.
Test build #42471 has finished for PR 7884 at commit
|
LGTM but there are some merge conflicts in |
Test build #42497 has finished for PR 7884 at commit
|
@@ -73,7 +81,11 @@ object MimaExcludes { | |||
"org.apache.spark.ml.regression.LeastSquaresCostFun.this"), | |||
ProblemFilters.exclude[MissingMethodProblem]( | |||
"org.apache.spark.ml.classification.LogisticCostFun.this"), | |||
// SQL execution is considered private. | |||
ProblemFilters.exclude[MissingMethodProblem]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they necessary for 1.5? It would be nice to group rules from LogisticAggregator
.
Test build #42502 has finished for PR 7884 at commit
|
Test build #42501 has finished for PR 7884 at commit
|
Two failed tests are from core and not related to this PR:
I'm going to merge this. |
Merged into master. Thanks! |
In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.