-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics #24470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Change-Id: Ic7d10b6844b480c77940707db9a722fd6927bd67
Change-Id: Ifd4cd7b1181957c99be33e55995aa4c62d963d5b
Change-Id: Ib815be7c34b913b6a005fb8f7f53f182be8c9e21
@srowen Could you please have a look at this pull request? Thanks a lot! |
This doesn't look like a bug. I can't understand the argument in the JIRA why the last vs first element of a bin is more representative. Both are approximations. |
@srowen Yes, both are approximations. But it has less error if we choose the last element in each chunk as the threshold. counts.mapPartitions(_.grouped(grouping.toInt).map { pairs =>
// The score of the combined point will be just the first one's score
val firstScore = pairs.head._1
// The point will contain all counts in this chunk
val agg = new BinaryLabelCounter()
pairs.foreach(pair => agg += pair._2)
(firstScore, agg)
}) You can see, counters ( |
I still don't see the argument that the first or last is better. They are simply the endpoints of the range of scores within the bin. As the number of bins increases, the range is smaller. If you are worried about this difference, you need more bins. Your argument cuts two ways: having a slightly higher threshold than desired can cause as many problems as slightly smaller. What would be possibly better here is to compute the score of a bin as a weighted average of its elements. That would be OK though you'd have to change many tests. I think the current implementation is designed to match scikit (?) |
@srowen Get your point! |
Ah OK I agree with you, I see the argument now. It comes from the fact that the scores are sorted descending. The score of each bin is currently its maximum, not minimum. The precision / recall for each bin is calculated as if all of the instances in the bin were classified as positive. This only makes sense if the score is the minimum. You might mention something to this effect in the comment in the code. |
Test build #4776 has finished for PR 24470 at commit
|
Change-Id: I270aa88bc06c7aac9e4fcf4978f4bb6b9dcac93b
@srowen Great thanks for your patience! counts.mapPartitions(_.grouped(grouping.toInt).map { pairs =>
// The score of the combined point will be just the last one's score, which is also
// the minimal in each chunk since all scores are already sorted in descending.
val lastScore = pairs.last._1
// The combined point will contain all counts in this chunk. Thus, calculated
// metrics (like precision, recall, etc.) on its score (or so-called threshold) are
// the same as those without sampling.
val agg = new BinaryLabelCounter()
pairs.foreach(pair => agg += pair._2)
(lastScore, agg)
}) Besides, I have scanned all unit tests and class references in the Spark code repository. None of them uses |
@srowen Execuse me, is there anything I should do before merging this pull request? Thanks a lot! |
No, I leave these open for a day or two to make sure there aren't more comments. |
…cationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes #24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit d5308cd) Signed-off-by: Sean Owen <sean.owen@databricks.com>
Merged to master/2.4/2.3 |
…cationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes #24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit d5308cd) Signed-off-by: Sean Owen <sean.owen@databricks.com>
…cationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes apache#24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit d5308cd) Signed-off-by: Sean Owen <sean.owen@databricks.com>
…cationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes apache#24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit d5308cd) Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
Choose the last record in chunks when calculating metrics with downsampling in
BinaryClassificationMetrics
.How was this patch tested?
A new unit test is added to verify thresholds from downsampled records.