[SPARK-3251][MLLIB]: Clarify learning interfaces #2137

BigCrunsh · 2014-08-26T14:14:18Z

** Make threshold mandatory **
Currently, the output of predict for an example is either the score
or the class. This side-effect is caused by clearThreshold. To
clarify that behaviour three different types of predict (predictScore,
predictClass, predictProbabilty) were introduced; the threshold is not
longer optional.

** Clarify classification interfaces
Currently, some functionality is spreaded over multiple models.
In order to clarify the structure and simplify the implementation of
more complex models (like multinomial logistic regression), two new
classes are introduced:

BinaryClassificationModel: for all models that derives a binary
classification from a single weight vector. Comprises the tresholding
functionality to derive a prediction from a score. It basically captures
SVMModel and LogisticRegressionModel.
ProbabilitistClassificaitonModel: This trait defines the interface for
models that return a calibrated confidence score (aka probability).

** Misc

some renaming
add test for probabilistic output

mengxr · 2014-08-27T09:56:20Z

@BigCrunsh Do you mind creating a JIRA and add the JIRA number to this PR title, e.g., [SPARK-####][MLLIB]?

clearThreshold was a workaround, which I don't really like. But let us try not to break many existing APIs. Another comment is that predictProbability is somewhat misleading. Though LR outputs probability, but it is in theory with strong assumption on the error distribution. Usually this is not the case in practice, and users should not use them as probabilities without calibration.

mengxr · 2014-08-27T09:56:27Z

Jenkins, test this please.

SparkQA · 2014-08-27T10:01:09Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19306/consoleFull

SparkQA · 2014-08-27T10:02:08Z

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19306/consoleFull

BigCrunsh · 2014-08-27T12:23:28Z

@mengxr, I agree that we should avoid breaking existing API's. Is it fine for you to re-add predict and clearThreshold and mark them as deprecated?

BigCrunsh · 2014-08-27T13:08:02Z

Currently, MLLIB contains linear models (GLMs) that produce scores based on an inner product, classification models that might derive a classification using scores, and probabilistic models that provide a confidence score (or a probability under some model assumption) in addition to the predicted class. Currently the score for classification models is only available by removing the threshold:

val classes = model.predict(testset)
val scores = model.clearThreshold().predict(testset)

The threshold is lost after the last step and for LogReg it is not possible to access the (uncalibrated) score. However, depending on the model, I would expect that one has direct and consistent access to all of these values:

val classes = model.predictClass(testset)
val scores = model.predictScore(testset)
val probs = model.predictProbability(testset)

@mengxr: I think in general a probability is some measure of likeliness that an event will occur. It is often based on some more or less realistic model assumptions (e.g., normal assumption in regression, t-tests, etc.), isn't it? The exponential family, which is the assumption of the class-wise conditional distributions p(features|class), comprises commonly used distributions such as multinomial, Poisson, and Gaussian distribution. The learning algorithm (with tuned hyper-parameter) is then "responsible" to calibrated these probabilities. Do you have a more appropriate name to distinguish between scores and "probabilities"?

mengxr · 2014-08-28T07:35:01Z

Jenkins, add to whitelist.

SparkQA · 2014-08-28T07:41:03Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19382/consoleFull

mengxr · 2014-08-28T07:49:01Z

The assumption is usually unrealistic. For logistic regression, it is common to have the predictions be something like 0.99999 or 0.000001, and they cannot be interpreted as probabilities without calibration. Logistic regression is not responsible for it. I created a JIRA for isotonic regression, which can be used for calibration: https://issues.apache.org/jira/browse/SPARK-3278

For the method names, my suggestion would be: add classify that outputs classes using a threshold, keep predict that output the raw predictions. Do not distinguish predictScore and predictProb.

SparkQA · 2014-08-28T08:44:23Z

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19382/consoleFull

BigCrunsh · 2014-08-28T10:30:19Z

@mengxr, might it be that you mistake logistic regression for Naive Bayes? Logistic regression typically predicts well-calibrated probabilities, see e.g. [1]; it might only be problematic if the data can be separated perfectly. The learning algorithm returns ("is responsible for") a model that maximizes the likelihood of the data under the model assumption; in classification, the returned "probability" measures how likely it is that a certain label is generated by the learned model for a given example. Adding an isotonic regression is a good idea anyways.

I think we should definitely distinguish between the output of the linear model (score) and the calibrated value (probability); it depends on the task, which one of them is needed. Furthermore, having a function that changes the type of output depending on the model is misleading. E.g, one should expect that a score function always returns an arbitrary real value and that the calibrated version returns a value between zero and one. Sklearn [2] for example makes this distinctions too: decision_function for scores, predict for class labels, predict_proba for probability estimates. However, it is not obvious what predict returns (@mengxr: what do you mean with "raw predictions"). My suggestion would be:

classify or predictClass for the class;
score or decisionValue or predictScore for the outcome of the linear model;
probabilityEstimate or predictProbability for an estimate of the class probability.

Perhaps predict could return the class for classification and the regression value for regression tasks (or just be maintained as deprecated version).

[1] Niculescu-Mizil, Alexandru, and Rich Caruana. "Predicting good probabilities with supervised learning." Proceedings of the 22nd international conference on Machine learning. ACM, 2005.
[2] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

SparkQA · 2014-08-28T12:11:08Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19399/consoleFull

SparkQA · 2014-08-28T13:27:35Z

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19399/consoleFull

SparkQA · 2014-08-28T14:50:42Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19406/consoleFull

SparkQA · 2014-08-28T16:51:05Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19410/consoleFull

shaneknapp · 2014-08-28T16:51:54Z

Jenkins, retest this please

SparkQA · 2014-08-28T16:54:27Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19411/consoleFull

SparkQA · 2014-08-28T18:03:38Z

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19410/consoleFull

SparkQA · 2014-08-28T18:09:07Z

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19411/consoleFull

BigCrunsh · 2014-09-01T09:48:58Z

Jenkins, retest this please

SparkQA · 2014-09-01T09:54:07Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19554/consoleFull

SparkQA · 2014-09-01T11:19:30Z

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19554/consoleFull

* Make threshold mandatory Currently, the output of ``predict`` for an example is either the score or the class. This side-effect is caused by ``clearThreshold``. To clarify that behaviour three different types of predict (predictScore, predictClass, predictProbabilty) were introduced; the threshold is not longer optional. * Clarify classification interfaces Currently, some functionality is spreaded over multiple models. In order to clarify the structure and simplify the implementation of more complex models (like multinomial logistic regression), two new classes are introduced: - BinaryClassificationModel: for all models that derives a binary classification from a single weight vector. Comprises the tresholding functionality to derive a prediction from a score. It basically captures SVMModel and LogisticRegressionModel. - ProbabilitistClassificaitonModel: This trait defines the interface for models that return a calibrated confidence score (aka probability). * Misc - some renaming - add test for probabilistic output

- scalastyle issues - java test suite - java logistic regression suite - add deprecated versions

BigCrunsh · 2014-09-01T12:45:02Z

Jenkins, retest this please

SparkQA · 2014-09-01T12:49:01Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19559/consoleFull

SparkQA · 2014-09-01T12:49:59Z

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19559/consoleFull

BigCrunsh · 2014-09-01T13:16:31Z

Jenkins, retest this please

SparkQA · 2014-09-01T13:19:03Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19560/consoleFull

SparkQA · 2014-09-01T13:22:26Z

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19560/consoleFull

BigCrunsh · 2014-09-01T14:53:35Z

Jenkins, retest this please

SparkQA · 2014-09-01T14:59:09Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19564/consoleFull

SparkQA · 2014-09-01T16:23:52Z

QA results for PR 2137:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19564/consoleFull

BigCrunsh · 2014-09-01T17:05:36Z

Jenkins, retest this please

SparkQA · 2014-09-01T17:09:07Z

QA tests have started for PR 2137. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19567/consoleFull

SparkQA · 2014-09-01T18:33:24Z

QA results for PR 2137:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19567/consoleFull

BigCrunsh · 2014-09-01T21:29:44Z

@mengxr, do you agree with this modification?

mengxr · 2014-09-03T05:32:47Z

@BigCrunsh No, I meant logistic regression. As you mentioned, LR's output will be off by a huge margin when the points are easily separable. There are other cases that bias the output, for example, unbalanced label distribution and different regularization. My point is that we shouldn't interpret LR's output as probabilities without calibration, and calling it predictProb is misleading. A user may use the output directly to estimate, e.g., expect revenue, which is wrong.

It will be really helpful if you are interested in implementing isotonic regression, which was used as the baseline in [1].

How about adding a new method called predictRaw to ClassificationModel:

def predictRaw(point: Vector): Vector

It returns an array of size numClasses, which contains the confidence/margin/probability score for each class. For LR, this is the output from logistic regression. For SVM, this is [margin, -margin]. For classification tree, this is probability for each class at the leaf node.

SparkQA · 2014-09-05T23:53:41Z

QA tests have started for PR 2137 at commit 3920339.

This patch merges cleanly.

SparkQA · 2014-09-06T01:38:26Z

QA tests have finished for PR 2137 at commit 3920339.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

BigCrunsh · 2014-09-11T13:14:12Z

I have to admit that this PR may try to address too many issues at once. It think the major ones are: Ideally,

the model should be immutable and stateless;
the output type of predict should neither depend on whether threshold is set or not nor on the kind of model;
the model should provide access to all variables of interest (scores, classes, probabilities);
we need a distinction between multi-class and binary classification model that inherent from GLMs.

My suggestions for models that inherit from GLMs are:

introduce more specific predict functions that distinguish between (inner products) scores, probabilities (no matter of the naming), and the classes (might be nice to have some traits for that too);
extend the hierarchy of models; it seems to be necessary to have a distinction between multi- and binary class.
remove clearThreshold.

I think it make sense to address these issues first, before we start implementing new algorithms. If we can agree on some of these points, I would be happy to help and break down this PR (and also implementing further algorithms as isotonic or multiclass logistic regression).

jkbradley · 2014-09-26T21:34:11Z

@BigCrunsh I think this PR brings up a lot of really important issues, and I agree with you that it might be better broken into different PRs. Clarifying a hierarchy and the methods each abstract type should implement would be very useful (especially for ensemble methods), and perhaps should come first. I created a JIRA and am working on a design doc for this hierarchy, including for things beyond GLMs; it would be great to get feedback about it. (I linked it to your JIRA.) After we discuss on the JIRA, we can figure out a good plan for PRs. Does that sound reasonable? Thanks!

BigCrunsh · 2014-09-27T15:11:37Z

@jkbradley: yes, sounds great! I will give feedback to the design sheet asap.

jkbradley · 2014-11-24T08:50:22Z

@BigCrunsh I just submitted a WIP for the new MLlib API. Apologies for the slow development, but I'd like to try to get your PR in to improve the original MLlib API as much as possible. Please let me know if you have feedback for my PR, and we can try to think about how best to align the new & old APIs as much as possible. Thanks!

jkbradley · 2014-12-16T22:38:32Z

@BigCrunsh Could we please discuss our respective interfaces to converge on some standard naming conventions? I’d like to get your PR merged to update the main spark.mllib package, but also make sure it matches the experimental spark.ml package as much as possible. There are really only a few items to decide.

My votes on naming classes & methods:

BinaryClassificationModel -> BinaryClassificationGLM (more precise)
predictClass -> predict
- I vote for “predict()” always meaning predict the label (class for classification, or real value for regression). That seems more standard (e.g., scikit-learn uses this convention).
predictScore -> predictRaw?
- “score” is a very overloaded term, and “raw” might be more intuitive.

+1 for ProbabilisticClassificationModel. But the current version sounds specific to binary classification. Would you want to rename it to ProbabilisticBinaryClassificationModel, or generalize it to return the probability for each possible label? (I’m doing the latter in my PR, using predictProbabilities().)

After we settle on these items, I’d like to make a detailed pass over your PR. Thanks in advance!

jkbradley · 2014-12-16T22:38:45Z

mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala

@@ -45,14 +46,58 @@ abstract class GeneralizedLinearModel(val weights: Vector, val intercept: Double
   * @param weightMatrix Column vector containing the weights of the model
   * @param intercept Intercept of the model.
   */
+  protected def computeScore(
+      dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double = {
+    weightMatrix.toBreeze.dot(dataMatrix.toBreeze) + intercept


You can use mllib.linalg.BLAS.dot() now, rather than converting to breeze.

srowen · 2015-05-18T18:15:29Z

I think this has gone stale and timed out. do you mind closing this PR?

BigCrunsh changed the title ~~mllib: Clarify learning interfaces~~ [SPARK-2137][MLLIB]: Clarify learning interfaces Aug 27, 2014

BigCrunsh changed the title ~~[SPARK-2137][MLLIB]: Clarify learning interfaces~~ [SPARK-3251][MLLIB]: Clarify learning interfaces Aug 27, 2014

BigCrunsh added 2 commits September 1, 2014 14:00

Fix test issues

6054b9c

- scalastyle issues - java test suite - java logistic regression suite - add deprecated versions

WIP

0c4341e

WIP

140f09c

WIP

3920339

jkbradley mentioned this pull request Nov 24, 2014

[MLLIB] [WIP] [SPARK-3702] Standardizing abstractions and developer API for prediction #3427

Closed

6 tasks

jkbradley reviewed Dec 16, 2014
View reviewed changes

asfgit closed this in 6916533 Jul 9, 2015

[SPARK-3251][MLLIB]: Clarify learning interfaces #2137

[SPARK-3251][MLLIB]: Clarify learning interfaces #2137

Uh oh!

Conversation

BigCrunsh commented Aug 26, 2014

Uh oh!

mengxr commented Aug 27, 2014

Uh oh!

mengxr commented Aug 27, 2014

Uh oh!

SparkQA commented Aug 27, 2014

Uh oh!

SparkQA commented Aug 27, 2014

Uh oh!

BigCrunsh commented Aug 27, 2014

Uh oh!

BigCrunsh commented Aug 27, 2014

Uh oh!

mengxr commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

mengxr commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

BigCrunsh commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

shaneknapp commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

BigCrunsh commented Sep 1, 2014

Uh oh!

SparkQA commented Sep 1, 2014

Uh oh!

SparkQA commented Sep 1, 2014

Uh oh!

BigCrunsh commented Sep 1, 2014

Uh oh!

SparkQA commented Sep 1, 2014

Uh oh!

SparkQA commented Sep 1, 2014

Uh oh!

BigCrunsh commented Sep 1, 2014

Uh oh!

SparkQA commented Sep 1, 2014

Uh oh!

SparkQA commented Sep 1, 2014

Uh oh!

BigCrunsh commented Sep 1, 2014

Uh oh!

SparkQA commented Sep 1, 2014

Uh oh!

SparkQA commented Sep 1, 2014

Uh oh!

BigCrunsh commented Sep 1, 2014

Uh oh!

SparkQA commented Sep 1, 2014

Uh oh!

SparkQA commented Sep 1, 2014

Uh oh!

BigCrunsh commented Sep 1, 2014

Uh oh!

mengxr commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!