[SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics #24470

shishaochen · 2019-04-26T09:18:41Z

What changes were proposed in this pull request?

Choose the last record in chunks when calculating metrics with downsampling in BinaryClassificationMetrics.

How was this patch tested?

A new unit test is added to verify thresholds from downsampled records.

Change-Id: Ic7d10b6844b480c77940707db9a722fd6927bd67

Change-Id: Ifd4cd7b1181957c99be33e55995aa4c62d963d5b

Change-Id: Ib815be7c34b913b6a005fb8f7f53f182be8c9e21

shishaochen · 2019-05-03T23:27:34Z

@srowen Could you please have a look at this pull request? Thanks a lot!

srowen · 2019-05-04T00:59:24Z

This doesn't look like a bug. I can't understand the argument in the JIRA why the last vs first element of a bin is more representative. Both are approximations.

shishaochen · 2019-05-04T02:46:00Z

@srowen Yes, both are approximations. But it has less error if we choose the last element in each chunk as the threshold.
And the essential problem is that, the so-called "downsampling" is not real sampling. The code behind calculates precision, recall, etc. based on statistics (like TP, NP, TF, NF) of all elements.

counts.mapPartitions(_.grouped(grouping.toInt).map { pairs =>
  // The score of the combined point will be just the first one's score
  val firstScore = pairs.head._1
  // The point will contain all counts in this chunk
  val agg = new BinaryLabelCounter()
  pairs.foreach(pair => agg += pair._2)
  (firstScore, agg)
})

You can see, counters (BinaryLabelCounter) of all elements are merged into one instead of return the first element directly.
Thus, from the definition of threshold, the score of the last element (which is the minimal one) is the right threshold to use when inference.
In online systems, we need choose the right threshold to predict whether an instance is positive (score>=threshold) or negative (score<threshold).
For example, in a high-risk-detection model for videos where RECALL is extemely important, we choose a threshold from what BinaryClassificationMetrics prints. When numBins is set to 200, each chunk has about 0.5% instances. The wrong threshold given by score of the first element will miss lots of videos (1 million per day in total) that should be considered dangerous.

srowen · 2019-05-04T13:15:47Z

I still don't see the argument that the first or last is better. They are simply the endpoints of the range of scores within the bin. As the number of bins increases, the range is smaller. If you are worried about this difference, you need more bins. Your argument cuts two ways: having a slightly higher threshold than desired can cause as many problems as slightly smaller.

What would be possibly better here is to compute the score of a bin as a weighted average of its elements. That would be OK though you'd have to change many tests. I think the current implementation is designed to match scikit (?)

shishaochen · 2019-05-05T11:40:09Z

@srowen Get your point!
Actually, if we choose score of the last element in each chunk as threshold, the calculated Recall, Precision, FMeasure on each threshold are exactly the same as those when no sampling (numBins=0).
In other words, they are accurate metrics. The only difference is the count of thresholds when printing precision/recall/f1 curve compared to downsampling.
Thus, why not return the correct metrics of full data set but approximate values?

srowen · 2019-05-05T14:47:02Z

Ah OK I agree with you, I see the argument now. It comes from the fact that the scores are sorted descending. The score of each bin is currently its maximum, not minimum. The precision / recall for each bin is calculated as if all of the instances in the bin were classified as positive. This only makes sense if the score is the minimum.

You might mention something to this effect in the comment in the code.
Also I think this may change some test results; let's see.

SparkQA · 2019-05-05T15:52:10Z

Test build #4776 has finished for PR 24470 at commit fa2eae6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: I270aa88bc06c7aac9e4fcf4978f4bb6b9dcac93b

shishaochen · 2019-05-06T02:16:20Z

@srowen Great thanks for your patience!
I have added explanation in code comments at BinaryClassificationMetrics.scala. Do these words at below match your expectations?

counts.mapPartitions(_.grouped(grouping.toInt).map { pairs =>
  // The score of the combined point will be just the last one's score, which is also
  // the minimal in each chunk since all scores are already sorted in descending.
  val lastScore = pairs.last._1
  // The combined point will contain all counts in this chunk. Thus, calculated
  // metrics (like precision, recall, etc.) on its score (or so-called threshold) are
  // the same as those without sampling.
  val agg = new BinaryLabelCounter()
  pairs.foreach(pair => agg += pair._2)
  (lastScore, agg)
})

Besides, I have scanned all unit tests and class references in the Spark code repository. None of them uses numBins but one unit test in BinaryClassificationMetricsSuite, which only tests the ROC curve without threshold. Thus, it is safe to merge this pull request.

shishaochen · 2019-05-06T23:28:15Z

@srowen Execuse me, is there anything I should do before merging this pull request? Thanks a lot!

srowen · 2019-05-06T23:34:29Z

No, I leave these open for a day or two to make sure there aren't more comments.

…cationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes #24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit d5308cd) Signed-off-by: Sean Owen <sean.owen@databricks.com>

srowen · 2019-05-07T13:45:06Z

Merged to master/2.4/2.3

…cationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes #24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit d5308cd) Signed-off-by: Sean Owen <sean.owen@databricks.com>

…cationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes apache#24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit d5308cd) Signed-off-by: Sean Owen <sean.owen@databricks.com>

shishaochen added 3 commits April 26, 2019 15:53

Select correct thresholds when downsampling.

081a9d1

Change-Id: Ic7d10b6844b480c77940707db9a722fd6927bd67

Add test to verify downsampled thresholds.

ac05d74

Change-Id: Ifd4cd7b1181957c99be33e55995aa4c62d963d5b

Fix the comment.

fa2eae6

Change-Id: Ib815be7c34b913b6a005fb8f7f53f182be8c9e21

Explain correctness of metrics on thresholds.

147f239

Change-Id: I270aa88bc06c7aac9e4fcf4978f4bb6b9dcac93b

srowen approved these changes May 6, 2019

View reviewed changes

srowen closed this in d5308cd May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics #24470

[SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics #24470

Uh oh!

shishaochen commented Apr 26, 2019 •

edited

Loading

Uh oh!

shishaochen commented May 3, 2019

Uh oh!

srowen commented May 4, 2019

Uh oh!

shishaochen commented May 4, 2019 •

edited

Loading

Uh oh!

srowen commented May 4, 2019 •

edited

Loading

Uh oh!

shishaochen commented May 5, 2019

Uh oh!

srowen commented May 5, 2019

Uh oh!

SparkQA commented May 5, 2019

Uh oh!

shishaochen commented May 6, 2019 •

edited

Loading

Uh oh!

shishaochen commented May 6, 2019

Uh oh!

srowen commented May 6, 2019

Uh oh!

srowen commented May 7, 2019

Uh oh!

Uh oh!

[SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics #24470

[SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics #24470

Uh oh!

Conversation

shishaochen commented Apr 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

shishaochen commented May 3, 2019

Uh oh!

srowen commented May 4, 2019

Uh oh!

shishaochen commented May 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented May 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shishaochen commented May 5, 2019

Uh oh!

srowen commented May 5, 2019

Uh oh!

SparkQA commented May 5, 2019

Uh oh!

shishaochen commented May 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shishaochen commented May 6, 2019

Uh oh!

srowen commented May 6, 2019

Uh oh!

srowen commented May 7, 2019

Uh oh!

Uh oh!

shishaochen commented Apr 26, 2019 •

edited

Loading

shishaochen commented May 4, 2019 •

edited

Loading

srowen commented May 4, 2019 •

edited

Loading

shishaochen commented May 6, 2019 •

edited

Loading