Skip to content

[SPARK-3251][MLLIB]: Clarify learning interfaces #2137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

[SPARK-3251][MLLIB]: Clarify learning interfaces #2137

wants to merge 5 commits into from

Conversation

BigCrunsh
Copy link
Contributor

** Make threshold mandatory **
Currently, the output of predict for an example is either the score
or the class. This side-effect is caused by clearThreshold. To
clarify that behaviour three different types of predict (predictScore,
predictClass, predictProbabilty) were introduced; the threshold is not
longer optional.

** Clarify classification interfaces
Currently, some functionality is spreaded over multiple models.
In order to clarify the structure and simplify the implementation of
more complex models (like multinomial logistic regression), two new
classes are introduced:

  • BinaryClassificationModel: for all models that derives a binary
    classification from a single weight vector. Comprises the tresholding
    functionality to derive a prediction from a score. It basically captures
    SVMModel and LogisticRegressionModel.
  • ProbabilitistClassificaitonModel: This trait defines the interface for
    models that return a calibrated confidence score (aka probability).

** Misc

  • some renaming
  • add test for probabilistic output

@mengxr
Copy link
Contributor

mengxr commented Aug 27, 2014

@BigCrunsh Do you mind creating a JIRA and add the JIRA number to this PR title, e.g., [SPARK-####][MLLIB]?

clearThreshold was a workaround, which I don't really like. But let us try not to break many existing APIs. Another comment is that predictProbability is somewhat misleading. Though LR outputs probability, but it is in theory with strong assumption on the error distribution. Usually this is not the case in practice, and users should not use them as probabilities without calibration.

@mengxr
Copy link
Contributor

mengxr commented Aug 27, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19306/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19306/consoleFull

@BigCrunsh BigCrunsh changed the title mllib: Clarify learning interfaces [SPARK-2137][MLLIB]: Clarify learning interfaces Aug 27, 2014
@BigCrunsh
Copy link
Contributor Author

@mengxr, I agree that we should avoid breaking existing API's. Is it fine for you to re-add predict and clearThreshold and mark them as deprecated?

@BigCrunsh BigCrunsh changed the title [SPARK-2137][MLLIB]: Clarify learning interfaces [SPARK-3251][MLLIB]: Clarify learning interfaces Aug 27, 2014
@BigCrunsh
Copy link
Contributor Author

Currently, MLLIB contains linear models (GLMs) that produce scores based on an inner product, classification models that might derive a classification using scores, and probabilistic models that provide a confidence score (or a probability under some model assumption) in addition to the predicted class. Currently the score for classification models is only available by removing the threshold:

val classes = model.predict(testset)
val scores = model.clearThreshold().predict(testset)

The threshold is lost after the last step and for LogReg it is not possible to access the (uncalibrated) score. However, depending on the model, I would expect that one has direct and consistent access to all of these values:

val classes = model.predictClass(testset)
val scores = model.predictScore(testset)
val probs = model.predictProbability(testset)

@mengxr: I think in general a probability is some measure of likeliness that an event will occur. It is often based on some more or less realistic model assumptions (e.g., normal assumption in regression, t-tests, etc.), isn't it? The exponential family, which is the assumption of the class-wise conditional distributions p(features|class), comprises commonly used distributions such as multinomial, Poisson, and Gaussian distribution. The learning algorithm (with tuned hyper-parameter) is then "responsible" to calibrated these probabilities. Do you have a more appropriate name to distinguish between scores and "probabilities"?

@mengxr
Copy link
Contributor

mengxr commented Aug 28, 2014

Jenkins, add to whitelist.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19382/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Aug 28, 2014

The assumption is usually unrealistic. For logistic regression, it is common to have the predictions be something like 0.99999 or 0.000001, and they cannot be interpreted as probabilities without calibration. Logistic regression is not responsible for it. I created a JIRA for isotonic regression, which can be used for calibration: https://issues.apache.org/jira/browse/SPARK-3278

For the method names, my suggestion would be: add classify that outputs classes using a threshold, keep predict that output the raw predictions. Do not distinguish predictScore and predictProb.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19382/consoleFull

@BigCrunsh
Copy link
Contributor Author

@mengxr, might it be that you mistake logistic regression for Naive Bayes? Logistic regression typically predicts well-calibrated probabilities, see e.g. [1]; it might only be problematic if the data can be separated perfectly. The learning algorithm returns ("is responsible for") a model that maximizes the likelihood of the data under the model assumption; in classification, the returned "probability" measures how likely it is that a certain label is generated by the learned model for a given example. Adding an isotonic regression is a good idea anyways.

I think we should definitely distinguish between the output of the linear model (score) and the calibrated value (probability); it depends on the task, which one of them is needed. Furthermore, having a function that changes the type of output depending on the model is misleading. E.g, one should expect that a score function always returns an arbitrary real value and that the calibrated version returns a value between zero and one. Sklearn [2] for example makes this distinctions too: decision_function for scores, predict for class labels, predict_proba for probability estimates. However, it is not obvious what predict returns (@mengxr: what do you mean with "raw predictions"). My suggestion would be:

  • classify or predictClass for the class;
  • score or decisionValue or predictScore for the outcome of the linear model;
  • probabilityEstimate or predictProbability for an estimate of the class probability.

Perhaps predict could return the class for classification and the regression value for regression tasks (or just be maintained as deprecated version).

[1] Niculescu-Mizil, Alexandru, and Rich Caruana. "Predicting good probabilities with supervised learning." Proceedings of the 22nd international conference on Machine learning. ACM, 2005.
[2] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19399/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19399/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19406/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19410/consoleFull

@shaneknapp
Copy link
Contributor

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19411/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19410/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19411/consoleFull

@BigCrunsh
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Sep 1, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19554/consoleFull

@SparkQA
Copy link

SparkQA commented Sep 1, 2014

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19554/consoleFull

* Make threshold mandatory
Currently, the output of ``predict`` for an example is either the score
or the class. This side-effect is caused by ``clearThreshold``. To
clarify that behaviour three different types of predict (predictScore,
predictClass, predictProbabilty) were introduced; the threshold is not
longer optional.

* Clarify classification interfaces
Currently, some functionality is spreaded over multiple models.
In order to clarify the structure and simplify the implementation of
more complex models (like multinomial logistic regression), two new
classes are introduced:
- BinaryClassificationModel: for all models that derives a binary
classification from a single weight vector. Comprises the tresholding
functionality to derive a prediction from a score. It basically captures
SVMModel and LogisticRegressionModel.
- ProbabilitistClassificaitonModel: This trait defines the interface for
models that return a calibrated confidence score (aka probability).

* Misc
- some renaming
- add test for probabilistic output
- scalastyle issues
- java test suite
- java logistic regression suite
- add deprecated versions
@BigCrunsh
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Sep 1, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19559/consoleFull

@SparkQA
Copy link

SparkQA commented Sep 1, 2014

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19559/consoleFull

@BigCrunsh
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Sep 1, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19560/consoleFull

@SparkQA
Copy link

SparkQA commented Sep 1, 2014

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19560/consoleFull

@BigCrunsh
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Sep 1, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19564/consoleFull

@SparkQA
Copy link

SparkQA commented Sep 1, 2014

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19564/consoleFull

@BigCrunsh
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Sep 1, 2014

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19567/consoleFull

@SparkQA
Copy link

SparkQA commented Sep 1, 2014

QA results for PR 2137:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19567/consoleFull

@BigCrunsh
Copy link
Contributor Author

@mengxr, do you agree with this modification?

@mengxr
Copy link
Contributor

mengxr commented Sep 3, 2014

@BigCrunsh No, I meant logistic regression. As you mentioned, LR's output will be off by a huge margin when the points are easily separable. There are other cases that bias the output, for example, unbalanced label distribution and different regularization. My point is that we shouldn't interpret LR's output as probabilities without calibration, and calling it predictProb is misleading. A user may use the output directly to estimate, e.g., expect revenue, which is wrong.

It will be really helpful if you are interested in implementing isotonic regression, which was used as the baseline in [1].

How about adding a new method called predictRaw to ClassificationModel:

def predictRaw(point: Vector): Vector

It returns an array of size numClasses, which contains the confidence/margin/probability score for each class. For LR, this is the output from logistic regression. For SVM, this is [margin, -margin]. For classification tree, this is probability for each class at the leaf node.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

QA tests have started for PR 2137 at commit 3920339.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have finished for PR 2137 at commit 3920339.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BigCrunsh
Copy link
Contributor Author

I have to admit that this PR may try to address too many issues at once. It think the major ones are: Ideally,

  • the model should be immutable and stateless;
  • the output type of predict should neither depend on whether threshold is set or not nor on the kind of model;
  • the model should provide access to all variables of interest (scores, classes, probabilities);
  • we need a distinction between multi-class and binary classification model that inherent from GLMs.

My suggestions for models that inherit from GLMs are:

  • introduce more specific predict functions that distinguish between (inner products) scores, probabilities (no matter of the naming), and the classes (might be nice to have some traits for that too);
  • extend the hierarchy of models; it seems to be necessary to have a distinction between multi- and binary class.
  • remove clearThreshold.

I think it make sense to address these issues first, before we start implementing new algorithms. If we can agree on some of these points, I would be happy to help and break down this PR (and also implementing further algorithms as isotonic or multiclass logistic regression).

@jkbradley
Copy link
Member

@BigCrunsh I think this PR brings up a lot of really important issues, and I agree with you that it might be better broken into different PRs. Clarifying a hierarchy and the methods each abstract type should implement would be very useful (especially for ensemble methods), and perhaps should come first. I created a JIRA and am working on a design doc for this hierarchy, including for things beyond GLMs; it would be great to get feedback about it. (I linked it to your JIRA.) After we discuss on the JIRA, we can figure out a good plan for PRs. Does that sound reasonable? Thanks!

@BigCrunsh
Copy link
Contributor Author

@jkbradley: yes, sounds great! I will give feedback to the design sheet asap.

@jkbradley
Copy link
Member

@BigCrunsh I just submitted a WIP for the new MLlib API. Apologies for the slow development, but I'd like to try to get your PR in to improve the original MLlib API as much as possible. Please let me know if you have feedback for my PR, and we can try to think about how best to align the new & old APIs as much as possible. Thanks!

@jkbradley
Copy link
Member

@BigCrunsh Could we please discuss our respective interfaces to converge on some standard naming conventions? I’d like to get your PR merged to update the main spark.mllib package, but also make sure it matches the experimental spark.ml package as much as possible. There are really only a few items to decide.

My votes on naming classes & methods:

  • BinaryClassificationModel -> BinaryClassificationGLM (more precise)
  • predictClass -> predict
    • I vote for “predict()” always meaning predict the label (class for classification, or real value for regression). That seems more standard (e.g., scikit-learn uses this convention).
  • predictScore -> predictRaw?
    • “score” is a very overloaded term, and “raw” might be more intuitive.

+1 for ProbabilisticClassificationModel. But the current version sounds specific to binary classification. Would you want to rename it to ProbabilisticBinaryClassificationModel, or generalize it to return the probability for each possible label? (I’m doing the latter in my PR, using predictProbabilities().)

After we settle on these items, I’d like to make a detailed pass over your PR. Thanks in advance!

@@ -45,14 +46,58 @@ abstract class GeneralizedLinearModel(val weights: Vector, val intercept: Double
* @param weightMatrix Column vector containing the weights of the model
* @param intercept Intercept of the model.
*/
protected def computeScore(
dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double = {
weightMatrix.toBreeze.dot(dataMatrix.toBreeze) + intercept
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use mllib.linalg.BLAS.dot() now, rather than converting to breeze.

@srowen
Copy link
Member

srowen commented May 18, 2015

I think this has gone stale and timed out. do you mind closing this PR?

@asfgit asfgit closed this in 6916533 Jul 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants