-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-3251][MLLIB]: Clarify learning interfaces #2137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@BigCrunsh Do you mind creating a JIRA and add the JIRA number to this PR title, e.g., [SPARK-####][MLLIB]?
|
Jenkins, test this please. |
QA tests have started for PR 2137. This patch merges cleanly. |
QA results for PR 2137: |
@mengxr, I agree that we should avoid breaking existing API's. Is it fine for you to re-add |
Currently, MLLIB contains linear models (GLMs) that produce scores based on an inner product, classification models that might derive a classification using scores, and probabilistic models that provide a confidence score (or a probability under some model assumption) in addition to the predicted class. Currently the score for classification models is only available by removing the threshold: val classes = model.predict(testset)
val scores = model.clearThreshold().predict(testset) The threshold is lost after the last step and for LogReg it is not possible to access the (uncalibrated) score. However, depending on the model, I would expect that one has direct and consistent access to all of these values: val classes = model.predictClass(testset)
val scores = model.predictScore(testset)
val probs = model.predictProbability(testset) @mengxr: I think in general a probability is some measure of likeliness that an event will occur. It is often based on some more or less realistic model assumptions (e.g., normal assumption in regression, t-tests, etc.), isn't it? The exponential family, which is the assumption of the class-wise conditional distributions |
Jenkins, add to whitelist. |
QA tests have started for PR 2137. This patch merges cleanly. |
The assumption is usually unrealistic. For logistic regression, it is common to have the predictions be something like 0.99999 or 0.000001, and they cannot be interpreted as probabilities without calibration. Logistic regression is not responsible for it. I created a JIRA for isotonic regression, which can be used for calibration: https://issues.apache.org/jira/browse/SPARK-3278 For the method names, my suggestion would be: add |
QA results for PR 2137: |
@mengxr, might it be that you mistake logistic regression for Naive Bayes? Logistic regression typically predicts well-calibrated probabilities, see e.g. [1]; it might only be problematic if the data can be separated perfectly. The learning algorithm returns ("is responsible for") a model that maximizes the likelihood of the data under the model assumption; in classification, the returned "probability" measures how likely it is that a certain label is generated by the learned model for a given example. Adding an isotonic regression is a good idea anyways. I think we should definitely distinguish between the output of the linear model (score) and the calibrated value (probability); it depends on the task, which one of them is needed. Furthermore, having a function that changes the type of output depending on the model is misleading. E.g, one should expect that a score function always returns an arbitrary real value and that the calibrated version returns a value between zero and one. Sklearn [2] for example makes this distinctions too:
Perhaps [1] Niculescu-Mizil, Alexandru, and Rich Caruana. "Predicting good probabilities with supervised learning." Proceedings of the 22nd international conference on Machine learning. ACM, 2005. |
QA tests have started for PR 2137. This patch merges cleanly. |
QA results for PR 2137: |
QA tests have started for PR 2137. This patch merges cleanly. |
QA tests have started for PR 2137. This patch merges cleanly. |
Jenkins, retest this please |
QA tests have started for PR 2137. This patch merges cleanly. |
QA results for PR 2137: |
QA results for PR 2137: |
Jenkins, retest this please |
QA tests have started for PR 2137. This patch merges cleanly. |
QA results for PR 2137: |
* Make threshold mandatory Currently, the output of ``predict`` for an example is either the score or the class. This side-effect is caused by ``clearThreshold``. To clarify that behaviour three different types of predict (predictScore, predictClass, predictProbabilty) were introduced; the threshold is not longer optional. * Clarify classification interfaces Currently, some functionality is spreaded over multiple models. In order to clarify the structure and simplify the implementation of more complex models (like multinomial logistic regression), two new classes are introduced: - BinaryClassificationModel: for all models that derives a binary classification from a single weight vector. Comprises the tresholding functionality to derive a prediction from a score. It basically captures SVMModel and LogisticRegressionModel. - ProbabilitistClassificaitonModel: This trait defines the interface for models that return a calibrated confidence score (aka probability). * Misc - some renaming - add test for probabilistic output
- scalastyle issues - java test suite - java logistic regression suite - add deprecated versions
Jenkins, retest this please |
QA tests have started for PR 2137. This patch merges cleanly. |
QA results for PR 2137: |
Jenkins, retest this please |
QA tests have started for PR 2137. This patch merges cleanly. |
QA results for PR 2137: |
Jenkins, retest this please |
QA tests have started for PR 2137. This patch merges cleanly. |
QA results for PR 2137: |
Jenkins, retest this please |
QA tests have started for PR 2137. This patch merges cleanly. |
QA results for PR 2137: |
@mengxr, do you agree with this modification? |
@BigCrunsh No, I meant logistic regression. As you mentioned, LR's output will be off by a huge margin when the points are easily separable. There are other cases that bias the output, for example, unbalanced label distribution and different regularization. My point is that we shouldn't interpret LR's output as probabilities without calibration, and calling it It will be really helpful if you are interested in implementing isotonic regression, which was used as the baseline in [1]. How about adding a new method called
It returns an array of size |
QA tests have started for PR 2137 at commit
|
QA tests have finished for PR 2137 at commit
|
I have to admit that this PR may try to address too many issues at once. It think the major ones are: Ideally,
My suggestions for models that inherit from GLMs are:
I think it make sense to address these issues first, before we start implementing new algorithms. If we can agree on some of these points, I would be happy to help and break down this PR (and also implementing further algorithms as isotonic or multiclass logistic regression). |
@BigCrunsh I think this PR brings up a lot of really important issues, and I agree with you that it might be better broken into different PRs. Clarifying a hierarchy and the methods each abstract type should implement would be very useful (especially for ensemble methods), and perhaps should come first. I created a JIRA and am working on a design doc for this hierarchy, including for things beyond GLMs; it would be great to get feedback about it. (I linked it to your JIRA.) After we discuss on the JIRA, we can figure out a good plan for PRs. Does that sound reasonable? Thanks! |
@jkbradley: yes, sounds great! I will give feedback to the design sheet asap. |
@BigCrunsh I just submitted a WIP for the new MLlib API. Apologies for the slow development, but I'd like to try to get your PR in to improve the original MLlib API as much as possible. Please let me know if you have feedback for my PR, and we can try to think about how best to align the new & old APIs as much as possible. Thanks! |
@BigCrunsh Could we please discuss our respective interfaces to converge on some standard naming conventions? I’d like to get your PR merged to update the main spark.mllib package, but also make sure it matches the experimental spark.ml package as much as possible. There are really only a few items to decide. My votes on naming classes & methods:
+1 for ProbabilisticClassificationModel. But the current version sounds specific to binary classification. Would you want to rename it to ProbabilisticBinaryClassificationModel, or generalize it to return the probability for each possible label? (I’m doing the latter in my PR, using predictProbabilities().) After we settle on these items, I’d like to make a detailed pass over your PR. Thanks in advance! |
@@ -45,14 +46,58 @@ abstract class GeneralizedLinearModel(val weights: Vector, val intercept: Double | |||
* @param weightMatrix Column vector containing the weights of the model | |||
* @param intercept Intercept of the model. | |||
*/ | |||
protected def computeScore( | |||
dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double = { | |||
weightMatrix.toBreeze.dot(dataMatrix.toBreeze) + intercept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use mllib.linalg.BLAS.dot() now, rather than converting to breeze.
I think this has gone stale and timed out. do you mind closing this PR? |
** Make threshold mandatory **
Currently, the output of
predict
for an example is either the scoreor the class. This side-effect is caused by
clearThreshold
. Toclarify that behaviour three different types of predict (predictScore,
predictClass, predictProbabilty) were introduced; the threshold is not
longer optional.
** Clarify classification interfaces
Currently, some functionality is spreaded over multiple models.
In order to clarify the structure and simplify the implementation of
more complex models (like multinomial logistic regression), two new
classes are introduced:
classification from a single weight vector. Comprises the tresholding
functionality to derive a prediction from a score. It basically captures
SVMModel and LogisticRegressionModel.
models that return a calibrated confidence score (aka probability).
** Misc