srt/11 - 3 - Error Metrics for Skewed Classes (12 min).srt

﻿1
00:00:00,290 --> 00:00:01,690
In the previous video, I talked
在前面的课程中
(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:02,060 --> 00:00:03,900
about error analysis and the
我提到了误差分析

3
00:00:04,350 --> 00:00:06,070
importance of having error
以及设定误差度量值的重要性

4
00:00:06,330 --> 00:00:07,480
metrics, that is of having
那就是

5
00:00:08,210 --> 00:00:10,200
a single real number evaluation metric
设定某个实数来评估你的学习算法

6
00:00:11,020 --> 00:00:13,290
for your learning algorithm to tell how well it's doing.
并衡量它的表现

7
00:00:14,310 --> 00:00:15,670
In the context of evaluation
有了算法的评估

8
00:00:16,700 --> 00:00:18,320
and of error metrics, there is
和误差度量值

9
00:00:18,430 --> 00:00:20,290
one important case, where it's
有一件重要的事情要注意

10
00:00:20,480 --> 00:00:22,180
particularly tricky to come
就是使用一个合适的误差度量值

11
00:00:22,510 --> 00:00:24,430
up with an appropriate error metric,
这有时会对于你的学习算法

12
00:00:24,930 --> 00:00:26,990
or evaluation metric, for your learning algorithm.
造成非常微妙的影响

13
00:00:28,040 --> 00:00:29,140
That case is the case
这件重要的事情就是

14
00:00:29,610 --> 00:00:31,310
of what's called skewed classes.
偏斜类（skewed classes）的问题

15
00:00:32,610 --> 00:00:33,480
Let me tell you what that means.
让我告诉你这是什么意思

16
00:00:36,170 --> 00:00:37,550
Consider the problem of cancer
想一想之前的癌症分类问题

17
00:00:38,180 --> 00:00:40,040
classification, where we have
我们拥有

18
00:00:40,300 --> 00:00:41,960
features of medical patients and
内科病人的特征变量

19
00:00:42,070 --> 00:00:44,050
we want to decide whether or not they have cancer.
我们希望知道他们是否患有癌症

20
00:00:44,630 --> 00:00:45,790
So this is like the malignant
因此这就像恶性

21
00:00:46,350 --> 00:00:48,290
versus benign tumor classification
与良性肿瘤的分类问题

22
00:00:48,930 --> 00:00:50,070
example that we had earlier.
我们之前讲过这个

23
00:00:51,140 --> 00:00:52,360
So let's say y equals 1 if the
我们假设 y=1 表示患者患有癌症

24
00:00:52,550 --> 00:00:53,780
patient has cancer and y equals 0
假设 y=0

25
00:00:54,280 --> 00:00:56,530
if they do not.
表示他们没有得癌症

26
00:00:56,810 --> 00:00:57,460
We have trained the progression
我们训练逻辑回归模型

27
00:00:57,940 --> 00:00:59,780
classifier and let's say
假设我们用测试集

28
00:01:00,000 --> 00:01:01,520
we test our classifier on
检验了这个分类模型

29
00:01:01,660 --> 00:01:04,470
a test set and find that we get 1 percent error.
并且发现它只有1%的错误

30
00:01:04,790 --> 00:01:05,720
So, we're making 99% correct diagnosis.
因此我们99%会做出正确诊断

31
00:01:06,530 --> 00:01:09,610
Seems like a really impressive result, right.
看起来是非常不错的结果

32
00:01:09,910 --> 00:01:10,920
We're correct 99% percent of the time.
我们99%的情况都是正确的

33
00:01:12,560 --> 00:01:13,630
But now, let's say we find
但是 假如我们发现

34
00:01:13,940 --> 00:01:15,660
out that only 0.5 percent
在测试集中

35
00:01:16,510 --> 00:01:17,950
of patients in our
只有0.5%的患者

36
00:01:18,160 --> 00:01:19,590
training test sets actually have cancer.
真正得了癌症

37
00:01:20,400 --> 00:01:21,900
So only half a
因此

38
00:01:21,950 --> 00:01:23,460
percent of the patients that
在我们的筛选程序里

39
00:01:23,580 --> 00:01:25,500
come through our screening process have cancer.
只有0.5%的患者患了癌症

40
00:01:26,560 --> 00:01:27,970
In this case, the 1%
因此在这个例子中

41
00:01:28,270 --> 00:01:30,010
error no longer looks so impressive.
1%的错误率就不再显得那么好了

42
00:01:31,130 --> 00:01:32,510
And in particular, here's a piece
举个具体的例子

43
00:01:32,670 --> 00:01:33,730
of code, here's actually a piece
这里有一行代码

44
00:01:33,850 --> 00:01:35,730
of non learning code that takes
不是机器学习代码

45
00:01:36,080 --> 00:01:38,260
this input of features x and it ignores it.
它忽略了输入值X

46
00:01:38,480 --> 00:01:39,820
It just sets y equals 0
它让y总是等于0

47
00:01:39,950 --> 00:01:41,640
and always predicts, you
因此它总是预测

48
00:01:41,720 --> 00:01:43,920
know, nobody has cancer and this
没有人得癌症

49
00:01:44,170 --> 00:01:45,720
algorithm would actually get
那么这个算法实际上只有

50
00:01:46,000 --> 00:01:47,840
0.5 percent error.
0.5%的错误率

51
00:01:48,830 --> 00:01:50,280
So this is even better than
因此这甚至比

52
00:01:50,400 --> 00:01:51,140
the 1% error that we were getting just now
我们之前得到的1%的错误率更好

53
00:01:51,240 --> 00:01:52,960
and this is a non
这是一个

54
00:01:53,160 --> 00:01:54,600
learning algorithm that you know, it is just
非机器学习算法

55
00:01:54,800 --> 00:01:56,950
predicting y equals 0 all the time.
因为它只是预测y总是等于0

56
00:01:57,990 --> 00:01:59,430
So this setting of when
这种情况发生在

57
00:02:00,060 --> 00:02:01,980
the ratio of positive to
正例和负例的比率

58
00:02:02,180 --> 00:02:04,130
negative examples is very close
非常接近于

59
00:02:04,810 --> 00:02:06,480
to one of two extremes, where,
一个极端

60
00:02:07,040 --> 00:02:08,620
in this case, the number of
在这个例子中

61
00:02:08,710 --> 00:02:10,050
positive examples is much,
正样本的数量

62
00:02:10,350 --> 00:02:11,310
much smaller than the number
与负样本的数量相比

63
00:02:11,620 --> 00:02:13,180
of negative examples because y
非常非常少

64
00:02:13,480 --> 00:02:15,500
equals one so rarely, this
因为y=1非常少

65
00:02:15,730 --> 00:02:16,850
is what we call the
我们把这种情况叫做

66
00:02:17,000 --> 00:02:18,600
case of skewed classes.
偏斜类

67
00:02:20,790 --> 00:02:21,710
We just have a lot more
一个类中的样本数

68
00:02:22,000 --> 00:02:23,140
of examples from one class
与另一个类的数据相比

69
00:02:23,570 --> 00:02:25,040
than from the other class.
多很多

70
00:02:25,220 --> 00:02:26,560
And by just predicting y equals
通过总是预测y=0

71
00:02:26,920 --> 00:02:28,270
0 all the time, or maybe
或者

72
00:02:28,650 --> 00:02:29,650
our predicting y equals 1
总是预测y=1

73
00:02:29,790 --> 00:02:32,080
all the time, an algorithm can do pretty well.
算法可能表现非常好

74
00:02:32,980 --> 00:02:34,050
So the problem with using
因此使用分类误差

75
00:02:34,670 --> 00:02:36,210
classification error or classification
或者分类精确度

76
00:02:36,590 --> 00:02:39,240
accuracy as our evaluation metric is the following.
来作为评估度量可能会产生如下问题

77
00:02:40,430 --> 00:02:41,360
Let's say you have one joining
假如说你有一个算法

78
00:02:41,700 --> 00:02:43,570
algorithm that's getting 99.2% accuracy.
它的精确度是99.2%

79
00:02:46,530 --> 00:02:47,200
So, that's a 0.8% error.
因此它只有0.8%的误差

80
00:02:47,330 --> 00:02:50,850
Let's say you
假设

81
00:02:51,000 --> 00:02:52,000
make a change to your algorithm
你对你的算法做出了一点改动

82
00:02:52,810 --> 00:02:53,890
and you now are getting
现在你得到了

83
00:02:54,280 --> 00:02:56,080
99.5% accuracy.
99.5%的精确度

84
00:02:59,280 --> 00:03:02,110
That is 0.5% error.
只有0.5%的误差

85
00:03:04,230 --> 00:03:06,460
So, is this an improvement to the algorithm or not?
这到底是不是算法的一个提升呢

86
00:03:06,770 --> 00:03:07,930
One of the nice things
用某个实数来

87
00:03:08,300 --> 00:03:09,990
about having a single real
作为评估度量值

88
00:03:10,120 --> 00:03:11,480
number evaluation metric is this
的一个好处就是

89
00:03:11,650 --> 00:03:13,080
helps us to quickly decide if
它可以帮助我们迅速决定

90
00:03:13,240 --> 00:03:15,530
we just need a good change to the algorithm or not.
我们是否需要对算法做出一些改进

91
00:03:16,370 --> 00:03:20,160
By going from 99.2% accuracy to 99.5% accuracy.
将精确度从99.2%提高到99.5%

92
00:03:21,430 --> 00:03:22,490
You know, did we just do something
但是我们的改进到底是有用的

93
00:03:22,780 --> 00:03:23,640
useful or did we
还是说

94
00:03:23,770 --> 00:03:25,150
just replace our code with
我们只是把代码替换成了

95
00:03:25,320 --> 00:03:26,690
something that just predicts y equals
例如总是预测y=0

96
00:03:27,000 --> 00:03:28,830
zero more often?
这样的东西

97
00:03:29,300 --> 00:03:30,430
So, if you have very skewed classes
因此如果你有一个偏斜类

98
00:03:31,340 --> 00:03:33,280
it becomes much harder to use
用分类精确度

99
00:03:33,640 --> 00:03:36,000
just classification accuracy, because you
并不能很好地衡量算法

100
00:03:36,120 --> 00:03:37,730
can get very high classification accuracies
因为你可能会获得一个很高的精确度

101
00:03:38,420 --> 00:03:40,950
or very low errors, and
非常低的错误率

102
00:03:41,110 --> 00:03:42,880
it's not always clear if
但是我们并不知道

103
00:03:43,070 --> 00:03:44,190
doing so is really improving
我们是否真的提升了

104
00:03:44,770 --> 00:03:45,780
the quality of your classifier
分类模型的质量

105
00:03:46,400 --> 00:03:48,320
because predicting y equals 0 all the
因为总是预测y=0

106
00:03:48,380 --> 00:03:50,710
time doesn't seem like
并不是一个

107
00:03:51,570 --> 00:03:52,570
a particularly good classifier.
好的分类模型

108
00:03:53,900 --> 00:03:55,500
But just predicting y equals 0 more
但是总是预测y=0

109
00:03:55,720 --> 00:03:57,300
often can bring your error
会将你的误差降低至

110
00:03:57,830 --> 00:03:59,460
down to, you know, maybe as
比如

111
00:03:59,650 --> 00:04:01,120
low as 0.5%.
降低至0.5%

112
00:04:01,490 --> 00:04:02,590
When we're faced with such
当我们遇到

113
00:04:02,770 --> 00:04:04,990
a skewed classes therefore we
这样一个偏斜类时

114
00:04:05,250 --> 00:04:06,350
would want to come up
我们希望有一个

115
00:04:06,470 --> 00:04:07,920
with a different error metric
不同的误差度量值

116
00:04:08,320 --> 00:04:09,500
or a different evaluation metric.
或者不同的评估度量值

117
00:04:10,290 --> 00:04:12,360
One such evaluation metric are
其中一种评估度量值

118
00:04:12,870 --> 00:04:14,240
what's called precision recall.
叫做查准率（precision）和召回率（recall）

119
00:04:15,440 --> 00:04:16,410
Let me explain what that is.
让我来解释一下

120
00:04:17,520 --> 00:04:19,890
Let's say we are evaluating a classifier on the test set.
假设我们正在用测试集来评估一个分类模型

121
00:04:20,750 --> 00:04:21,800
For the examples in the
对于

122
00:04:21,890 --> 00:04:23,890
test set the actual
测试集中的样本

123
00:04:25,450 --> 00:04:26,880
class of that example
每个测试集中的样本

124
00:04:27,320 --> 00:04:28,440
in the test set is going to
都会等于

125
00:04:28,550 --> 00:04:29,810
be either one or zero, right,
0或者1

126
00:04:30,440 --> 00:04:32,520
if there is a binary classification problem.
假设这是一个二分问题

127
00:04:33,870 --> 00:04:34,960
And what our learning algorithm
我们的学习算法

128
00:04:35,360 --> 00:04:37,070
will do is it will, you know,
要做的是

129
00:04:37,930 --> 00:04:39,270
predict some value for the
做出值的预测

130
00:04:39,450 --> 00:04:41,160
class and our learning
并且学习算法

131
00:04:41,560 --> 00:04:43,300
algorithm will predict the value
会为每一个

132
00:04:43,760 --> 00:04:44,830
for each example in my
测试集中的实例

133
00:04:44,910 --> 00:04:46,520
test set and the predicted value
做出预测

134
00:04:46,920 --> 00:04:48,560
will also be either one or zero.
预测值也是等于0或1

135
00:04:50,050 --> 00:04:52,060
So let me draw a two
让我画一个

136
00:04:52,270 --> 00:04:53,340
by two table as follows,
2x2的表格

137
00:04:53,910 --> 00:04:55,870
depending on a full of these entries
基于所有这些值

138
00:04:56,320 --> 00:04:57,800
depending on what was the
基于

139
00:04:57,960 --> 00:04:59,350
actual class and what was the predicted class.
实际的类与预测的类

140
00:05:00,220 --> 00:05:01,270
If we have an
如果

141
00:05:01,570 --> 00:05:02,890
example where the actual class is
有一个样本它实际所属的类是1

142
00:05:02,970 --> 00:05:03,950
one and the predicted class
预测的类也是1

143
00:05:04,240 --> 00:05:06,140
is one then that's called
那么

144
00:05:07,620 --> 00:05:08,640
an example that's a true
我们把这个样本叫做真阳性（true positive）

145
00:05:08,940 --> 00:05:10,300
positive, meaning our algorithm
意思是说我们的学习算法

146
00:05:10,730 --> 00:05:11,700
predicted that it's positive
预测这个值为阳性

147
00:05:12,400 --> 00:05:15,780
and in reality the example is positive.
实际上这个样本也确实是阳性

148
00:05:16,240 --> 00:05:17,300
If our learning algorithm predicted that
如果我们的学习算法

149
00:05:17,490 --> 00:05:19,010
something is negative, class zero,
预测某个值是阴性 等于0

150
00:05:19,570 --> 00:05:20,620
and the actual class is also
实际的类也确实属于0

151
00:05:20,970 --> 00:05:23,650
class zero then that's what's called a true negative.
那么我们把这个叫做真阴性（true negative）

152
00:05:24,070 --> 00:05:26,370
We predicted zero and it actually is zero.
我们预测为0的值实际上也等于0

153
00:05:27,880 --> 00:05:28,740
To find the other two boxes,
还剩另外的两个单元格

154
00:05:29,470 --> 00:05:31,120
if our learning algorithm predicts that
如果我们的学习算法

155
00:05:31,360 --> 00:05:33,210
the class is one but the
预测某个值等于1

156
00:05:34,340 --> 00:05:36,370
actual class is zero, then
但是实际上它等于0

157
00:05:36,670 --> 00:05:37,910
that's called a false positive.
这个叫做假阳性（false positive）

158
00:05:39,350 --> 00:05:40,630
So that means our algorithm for
比如我们的算法

159
00:05:40,830 --> 00:05:41,970
the patient is cancelled out in
预测某些病人患有癌症

160
00:05:42,190 --> 00:05:43,520
reality if the patient does not.
但是事实上他们并没有得癌症

161
00:05:44,730 --> 00:05:47,340
And finally, the last box is a zero, one.
最后 这个单元格是 1和0

162
00:05:48,200 --> 00:05:50,330
That's called a false negative
这个叫做假阴性（false negative）

163
00:05:51,180 --> 00:05:52,690
because our algorithm predicted
因为我们的算法预测值为0

164
00:05:53,450 --> 00:05:56,170
zero, but the actual class was one.
但是实际值是1

165
00:05:57,230 --> 00:05:59,020
And so, we
这样

166
00:05:59,150 --> 00:06:00,830
have this little sort of two by
我们有了一个2x2的表格

167
00:06:00,990 --> 00:06:02,720
two table based on
基于

168
00:06:03,250 --> 00:06:05,500
what was the actual class and what was the predicted class.
实际类与预测类

169
00:06:07,080 --> 00:06:08,380
So here's a different way
这样我们有了一个

170
00:06:08,690 --> 00:06:10,310
of evaluating the performance of
另一种方式来

171
00:06:10,420 --> 00:06:11,940
our algorithm. We're
评估算法的表现

172
00:06:12,550 --> 00:06:12,870
going to compute two numbers.
我们要计算两个数字

173
00:06:13,310 --> 00:06:14,780
The first is called precision -
第一个叫做查准率

174
00:06:14,940 --> 00:06:16,100
and what that says is,
这个意思是

175
00:06:17,170 --> 00:06:18,330
of all the patients where we've
对于所有我们预测

176
00:06:18,580 --> 00:06:19,580
predicted that they have cancer,
他们患有癌症的病人

177
00:06:20,640 --> 00:06:23,140
what fraction of them actually have cancer?
有多大比率的病人是真正患有癌症的

178
00:06:24,560 --> 00:06:25,310
So let me write this down,
让我把这个写下来

179
00:06:26,020 --> 00:06:27,300
the precision of a classifier
一个分类模型的查准率

180
00:06:27,680 --> 00:06:29,070
is the number of true
等于

181
00:06:29,310 --> 00:06:31,880
positives divided by
真阳性除以

182
00:06:32,940 --> 00:06:35,190
the number that we predicted
所有我们预测为阳性

183
00:06:37,140 --> 00:06:37,370
as positive, right?
的数量

184
00:06:39,150 --> 00:06:40,660
So of all the patients that
对于那些病人

185
00:06:41,090 --> 00:06:43,590
we went to those patients and we told them, "We think you have cancer."
我们告诉他们 "你们患有癌症"

186
00:06:43,890 --> 00:06:45,730
Of all those patients, what
对于这些病人而言

187
00:06:45,890 --> 00:06:47,410
fraction of them actually have cancer?
有多大比率是真正患有癌症的

188
00:06:47,500 --> 00:06:48,920
So that's called precision.
这个就叫做查准率

189
00:06:49,800 --> 00:06:50,680
And another way to write this
另一个写法是

190
00:06:50,950 --> 00:06:54,920
would be true positives and
分子是真阳性

191
00:06:55,010 --> 00:06:56,430
then in the denominator is the
分母是

192
00:06:56,670 --> 00:06:59,050
number of predicted positives, and
所有预测阳性的数量

193
00:06:59,210 --> 00:07:00,160
so that would be the
那么这个等于

194
00:07:00,240 --> 00:07:01,730
sum of the, you know, entries
表格第一行的值

195
00:07:02,410 --> 00:07:04,510
in this first row of the table.
的和

196
00:07:04,720 --> 00:07:07,760
So it would be true positives divided by true positives.
也就是真阳性除以真阳性...

197
00:07:08,670 --> 00:07:10,470
I'm going to abbreviate positive
这里我把阳性简写为

198
00:07:11,220 --> 00:07:12,980
as POS and then
POS

199
00:07:13,130 --> 00:07:15,470
plus false positives, again
加上假阳性

200
00:07:15,890 --> 00:07:18,550
abbreviating positive using POS.
这里我还是把阳性简写为POS

201
00:07:20,030 --> 00:07:21,850
So that's called precision, and as you
这个就叫做查准率

202
00:07:21,920 --> 00:07:23,490
can tell high precision would be good.
查准率越高就越好

203
00:07:23,660 --> 00:07:24,680
That means that all the patients
这是说 对于那些病人

204
00:07:25,070 --> 00:07:27,100
that we went to and we said, "You know, we're very sorry.
我们告诉他们 "非常抱歉

205
00:07:27,510 --> 00:07:28,960
We think you have cancer," high precision
我们认为你得了癌症"

206
00:07:29,440 --> 00:07:31,750
means that of that group
高查准率说明

207
00:07:31,980 --> 00:07:33,160
of patients most of them
对于这类病人

208
00:07:33,390 --> 00:07:34,460
we had actually made accurate
我们对预测他们得了癌症

209
00:07:34,820 --> 00:07:36,630
predictions on them and they do have cancer.
有很高的准确率

210
00:07:38,840 --> 00:07:39,880
The second number we're going to compute
另一个数字我们要计算的

211
00:07:40,440 --> 00:07:41,730
is called recall, and what
叫做召回率

212
00:07:42,060 --> 00:07:44,230
recall say is, if all
召回率是

213
00:07:44,480 --> 00:07:46,100
the patients in, let's say,
如果所有的病人

214
00:07:46,190 --> 00:07:47,510
in the test set or the
假设测试集中的病人

215
00:07:47,620 --> 00:07:48,830
cross-validation set, but if
或者交叉验证集中的

216
00:07:48,960 --> 00:07:49,980
all the patients in the data
如果所有这些在数据集中的病人

217
00:07:50,150 --> 00:07:51,550
set that actually have cancer,
确实得了癌症

218
00:07:52,670 --> 00:07:54,240
what fraction of them that
有多大比率

219
00:07:54,400 --> 00:07:56,250
we correctly detect as having cancer.
我们正确预测他们得了癌症

220
00:07:56,950 --> 00:07:57,870
So if all the patients have
如果所有的病人

221
00:07:58,090 --> 00:07:59,170
cancer, how many of them
都患了癌症

222
00:07:59,400 --> 00:08:01,130
did we actually go to them
有多少人我们能够

223
00:08:01,320 --> 00:08:03,850
and you know, correctly told them that we think they need treatment.
正确告诉他们 你需要治疗

224
00:08:05,860 --> 00:08:07,010
So, writing this down,
把这个写下来

225
00:08:07,360 --> 00:08:08,970
recall is defined as the
召回率被定义为

226
00:08:09,040 --> 00:08:12,020
number of positives, the number
真阳性

227
00:08:12,470 --> 00:08:14,760
of true positives,
的数量

228
00:08:15,260 --> 00:08:16,320
meaning the number of people
意思是我们正确预测

229
00:08:16,520 --> 00:08:17,890
that have cancer and that
患有癌症的人

230
00:08:18,030 --> 00:08:19,280
we correctly predicted have cancer
的数量

231
00:08:20,310 --> 00:08:21,440
and we take that and divide
我们用这个来

232
00:08:21,790 --> 00:08:23,510
that by, divide that by
除以

233
00:08:23,740 --> 00:08:29,300
the number of actual positives,
实际阳性

234
00:08:31,200 --> 00:08:32,070
so this is the right
这个值是

235
00:08:32,510 --> 00:08:35,190
number of actual positives of all the people that do have cancer.
所有患有癌症的人的数量

236
00:08:35,850 --> 00:08:37,000
What fraction do we directly
有多大比率

237
00:08:37,430 --> 00:08:38,950
flag and you know, send the treatment.
我们能正确发现癌症 并给予治疗

238
00:08:40,560 --> 00:08:41,780
So, to rewrite this in
把这个以另一种形式

239
00:08:41,930 --> 00:08:44,060
a different form, the denominator would
写下来

240
00:08:44,210 --> 00:08:45,160
be the number of actual
分母是

241
00:08:45,430 --> 00:08:46,990
positives as you know, is the sum
实际阳性的数量

242
00:08:47,220 --> 00:08:49,480
of the entries in this first column over here.
表格第一列值的和

243
00:08:50,600 --> 00:08:51,660
And so writing things out differently,
将这个以不同的形式写下来

244
00:08:52,160 --> 00:08:53,470
this is therefore, the number of
那就是

245
00:08:53,650 --> 00:08:57,120
true positives, divided by
真阳性除以

246
00:08:59,010 --> 00:09:01,340
the number of true positives
真阳性

247
00:09:02,790 --> 00:09:05,430
plus the number of
加上

248
00:09:06,750 --> 00:09:07,690
false negatives.
假阴性

249
00:09:09,570 --> 00:09:12,180
And so once again, having a high recall would be a good thing.
同样地 召回率越高越好

250
00:09:14,180 --> 00:09:15,810
So by computing precision and
通过计算查准率

251
00:09:15,930 --> 00:09:17,100
recall this will usually
和召回率

252
00:09:17,340 --> 00:09:18,740
give us a better sense of
我们能更好的知道

253
00:09:19,140 --> 00:09:20,560
how well our classifier is doing.
分类模型到底好不好

254
00:09:21,620 --> 00:09:22,960
And in particular if we have
具体地说

255
00:09:23,330 --> 00:09:24,740
a learning algorithm that predicts
如果我们有一个算法

256
00:09:25,520 --> 00:09:27,020
y equals zero all
总是预测y=0

257
00:09:27,190 --> 00:09:28,290
the time, if it predicts no
它总是预测

258
00:09:28,460 --> 00:09:30,080
one has cancer, then this
没有人患癌症

259
00:09:30,250 --> 00:09:31,880
classifier will have a
那么这个分类模型

260
00:09:32,070 --> 00:09:33,820
recall equal to zero,
召回率等于0

261
00:09:34,370 --> 00:09:35,300
because there won't be any
因为它不会有

262
00:09:35,570 --> 00:09:36,940
true positives and so that's
真阳性

263
00:09:37,190 --> 00:09:37,930
a quick way for us to
因此我们能会快发现

264
00:09:38,010 --> 00:09:40,290
recognize that, you know, a
这个分类模型

265
00:09:40,360 --> 00:09:41,570
classifier that predicts y equals 0 all the time,
总是预测y=0

266
00:09:42,050 --> 00:09:43,350
just isn't a very good classifier.
它不是一个好的模型

267
00:09:44,000 --> 00:09:46,660
And more generally,
总的来说

268
00:09:47,450 --> 00:09:48,830
even for settings where we
即使我们有一个

269
00:09:48,950 --> 00:09:50,800
have very skewed classes, it's
非常偏斜的类

270
00:09:51,050 --> 00:09:53,350
not possible for an
算法也不能够

271
00:09:53,440 --> 00:09:54,900
algorithm to sort of "cheat"
"欺骗"我们

272
00:09:55,450 --> 00:09:56,400
and somehow get a very
仅仅通过预测

273
00:09:56,750 --> 00:09:57,930
high precision and a
y总是等于0

274
00:09:58,010 --> 00:09:59,360
very high recall by doing
或者y总是等于1

275
00:09:59,620 --> 00:10:00,800
some simple thing like predicting
它没有办法得到

276
00:10:01,050 --> 00:10:02,680
y equals 0 all the time or
高的查准率

277
00:10:02,720 --> 00:10:04,720
predicting y equals 1 all the time.
和高的召回率

278
00:10:04,960 --> 00:10:06,540
And so we're much
因此我们

279
00:10:06,680 --> 00:10:08,230
more sure that a classifier
能够更肯定

280
00:10:08,840 --> 00:10:09,780
of a high precision or high recall
拥有高查准率或者高召回率的模型

281
00:10:10,610 --> 00:10:11,550
actually is a good classifier,
是一个好的分类模型

282
00:10:12,470 --> 00:10:13,940
and this gives us a
这给予了我们一个

283
00:10:14,040 --> 00:10:15,660
more useful evaluation metric that
更好的评估值

284
00:10:15,910 --> 00:10:16,960
is a more direct way to
给予我们一种更直接的方法

285
00:10:17,230 --> 00:10:20,360
actually understand whether, you know, our algorithm may be doing well.
来评估模型的好与坏

286
00:10:21,680 --> 00:10:23,000
So one final note in
最后一件需要记住的事

287
00:10:23,200 --> 00:10:24,960
the definition of precision and
在查准率和召回率的定义中

288
00:10:25,150 --> 00:10:26,190
recall, that we would define
我们定义

289
00:10:26,720 --> 00:10:28,720
precision and recall, usually we
查准率和召回率

290
00:10:29,100 --> 00:10:31,970
use the convention that y is equal to 1, in
我们总是习惯性地用y=1

291
00:10:32,090 --> 00:10:33,700
the presence of the more rare class.
如果这个类出现得非常少

292
00:10:34,160 --> 00:10:35,410
So if we are trying to detect.
因此如果我们试图检测

293
00:10:35,880 --> 00:10:37,300
rare conditions such as cancer,
某种很稀少的情况 比如癌症

294
00:10:37,720 --> 00:10:38,610
hopefully that's a rare condition,
我希望它是个很稀少的情况

295
00:10:39,340 --> 00:10:40,950
precision and recall are
查准率和召回率

296
00:10:41,000 --> 00:10:42,440
defined setting y equals
会被定义为

297
00:10:42,790 --> 00:10:43,930
1, rather than y
y=1

298
00:10:44,190 --> 00:10:45,690
equals 0, to be sort of
而不是y=0

299
00:10:45,820 --> 00:10:47,100
that the presence of that rare
作为某种我们希望检测的

300
00:10:47,250 --> 00:10:50,220
class that we're trying to detect.
出现较少的类

301
00:10:50,450 --> 00:10:51,960
And by using precision and recall,
通过使用查准率和召回率

302
00:10:52,890 --> 00:10:54,250
we find, what happens is
我们发现

303
00:10:54,390 --> 00:10:55,400
that even if we have
即使我们拥有

304
00:10:55,610 --> 00:10:57,400
very skewed classes, it's not
非常偏斜的类

305
00:10:57,590 --> 00:10:59,080
possible for an algorithm to
算法不能够

306
00:10:59,600 --> 00:11:01,060
you know, "cheat" and predict
通过总是预测y=1

307
00:11:01,380 --> 00:11:02,400
y equals 1 all the time,
来"欺骗"我们

308
00:11:02,760 --> 00:11:03,870
or predict y equals 0 all
或者总是预测y=0

309
00:11:03,980 --> 00:11:05,750
the time, and get high precision and recall.
因为它不能够获得高查准率和召回率

310
00:11:06,640 --> 00:11:07,830
And in particular, if a classifier
具体地说 如果一个分类模型

311
00:11:08,480 --> 00:11:09,700
is getting high precision and high
拥有高查准率和召回率

312
00:11:09,880 --> 00:11:11,160
recall, then we are
那么

313
00:11:11,270 --> 00:11:13,040
actually confident that the algorithm
我们可以确信地说

314
00:11:13,590 --> 00:11:15,120
has to be doing well, even
这个算法表现很好

315
00:11:15,400 --> 00:11:16,620
if we have very skewed classes.
即便我们拥有很偏斜的类

316
00:11:18,030 --> 00:11:20,360
So for the problem of skewed classes precision
因此对于偏斜类的问题

317
00:11:20,950 --> 00:11:22,560
recall gives us more
查准率和召回率

318
00:11:22,780 --> 00:11:24,670
direct insight into how
给予了我们更好的方法

319
00:11:24,910 --> 00:11:26,010
the learning algorithm is doing
来检测学习算法表现如何

320
00:11:26,660 --> 00:11:27,980
and this is often a much
这是一种

321
00:11:28,070 --> 00:11:29,360
better way to evaluate our learning algorithms,
更好地评估学习算法的标准

322
00:11:30,270 --> 00:11:32,200
than looking at classification error
当出现偏斜类时

323
00:11:32,510 --> 00:11:35,200
or classification accuracy, when the classes are very skewed.
比仅仅只用分类误差或者分类精度好