srt/15 - 4 - Developing and Evaluating an Anomaly Detection System (13 min).srt

﻿1
00:00:00,120 --> 00:00:01,220
In the last video, we developed
在上一段视频中
(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:01,850 --> 00:00:03,200
an anomaly detection algorithm.
我们推导了异常检测算法

3
00:00:04,150 --> 00:00:05,240
In this video, I like to
在这段视频中

4
00:00:05,300 --> 00:00:06,870
talk about the process of how
我想介绍一下

5
00:00:07,090 --> 00:00:08,750
to go about developing a specific
如果开发一个

6
00:00:09,060 --> 00:00:10,790
application of anomaly detection
关于异常检测的应用

7
00:00:11,410 --> 00:00:12,810
to a problem and in particular
来解决一个实际问题

8
00:00:13,470 --> 00:00:14,500
this will focus on the problem
具体来说

9
00:00:15,090 --> 00:00:18,700
of how to evaluate an anomaly detection algorithm. In
我们将重点关注如何评价一个异常检测算法

10
00:00:18,880 --> 00:00:20,490
previous videos, we've already talked
在前面的视频中 我们已经提到了

11
00:00:20,800 --> 00:00:22,380
about the importance of real
使用实数评价法的重要性

12
00:00:22,570 --> 00:00:24,770
number evaluation and this captures the idea that
这样做的想法是

13
00:00:25,170 --> 00:00:26,810
when you're trying to develop
当你在用某个学习算法

14
00:00:27,270 --> 00:00:28,460
a learning algorithm for a
来开发一个具体的

15
00:00:28,690 --> 00:00:30,300
specific application, you need to
机器学习应用时

16
00:00:30,560 --> 00:00:31,540
often make a lot of choices
你常常需要做出很多决定

17
00:00:31,710 --> 00:00:34,410
like, you know, choosing what features to use and then so on.
比如说 选择用什么样的特征 等等

18
00:00:35,010 --> 00:00:36,800
And making decisions about all
而如果你找到某种

19
00:00:36,880 --> 00:00:38,540
of these choices is often much
评价算法的方式

20
00:00:38,780 --> 00:00:39,890
easier, and if you have
直接返回一个数字

21
00:00:40,040 --> 00:00:41,330
a way to evaluate your learning
来告诉你算法的好坏

22
00:00:41,410 --> 00:00:43,190
algorithm that just gives you back a number.
那么你做这些决定就显得更容易了

23
00:00:44,200 --> 00:00:44,950
So if you're trying to decide,
所以比如你要决定

24
00:00:45,980 --> 00:00:47,130
you know, I have an idea for
现在有一个额外的特征

25
00:00:47,220 --> 00:00:49,730
one extra feature, do I include this feature or not.
我要不要把这个特征考虑进来？

26
00:00:50,560 --> 00:00:51,560
If you can run the algorithm
如果你带上这个特征

27
00:00:51,760 --> 00:00:52,830
with the feature, and run the
运行你的算法

28
00:00:52,960 --> 00:00:54,420
algorithm without the feature, and
再去掉这个特征运行你的算法

29
00:00:54,570 --> 00:00:55,960
just get back a number that
然后得到某个返回的数字

30
00:00:56,100 --> 00:00:57,350
tells you, you know, did
这个数字就直接告诉你

31
00:00:57,460 --> 00:01:00,070
it improve or worsen performance to add this feature?
这个特征到底是让算法表现变好了还是变差了

32
00:01:00,670 --> 00:01:01,480
Then it gives you a much better
这样 你就有了一种更好

33
00:01:01,670 --> 00:01:04,370
way, a much simpler way, with which
更简单的方法

34
00:01:04,590 --> 00:01:06,110
to decide whether or not to include that feature.
来确定是不是应该加上这个特征

35
00:01:07,570 --> 00:01:09,010
So in order to be
为了更快地

36
00:01:09,200 --> 00:01:10,850
able to develop an anomaly
开发出一个

37
00:01:11,410 --> 00:01:13,880
detection system quickly, it would
异常检测系统

38
00:01:14,080 --> 00:01:14,960
be a really helpful to have
那么最好能找到某种

39
00:01:15,150 --> 00:01:17,820
a way of evaluating an anomaly detection system.
评价异常检测系统的方法

40
00:01:19,260 --> 00:01:20,420
In order to do this,
为了做到这一点

41
00:01:20,790 --> 00:01:22,380
in order to evaluate an anomaly
为了能评价一个

42
00:01:22,730 --> 00:01:24,080
detection system, we're
异常检测系统

43
00:01:24,310 --> 00:01:26,380
actually going to assume have some labeled data.
我们先假定已有了一些带标签的数据

44
00:01:27,270 --> 00:01:28,270
So, so far, we'll be treating
所以 我们要考虑的

45
00:01:28,420 --> 00:01:29,870
anomaly detection as an
异常检测问题

46
00:01:30,310 --> 00:01:31,770
unsupervised learning problem, using
是一个非监督问题

47
00:01:32,210 --> 00:01:33,560
unlabeled data.
使用的是无标签数据

48
00:01:34,010 --> 00:01:35,190
But if you have some labeled
但如果你有一些

49
00:01:35,560 --> 00:01:37,390
data that specifies what
带标签的数据

50
00:01:37,700 --> 00:01:39,570
are some anomalous examples, and
能够指明哪些是异常样本

51
00:01:39,670 --> 00:01:42,030
what are some non-anomalous examples, then
哪些是非异常样本

52
00:01:42,470 --> 00:01:43,350
this is how we actually
那么这就是我们要找的

53
00:01:43,630 --> 00:01:45,670
think of as the standard way of evaluating an anomaly detection algorithm.
能够评价异常检测算法的标准方法

54
00:01:45,820 --> 00:01:49,020
So taking the
还是以

55
00:01:49,300 --> 00:01:50,580
aircraft engine example again.
飞机发动机的为例

56
00:01:51,010 --> 00:01:52,680
Let's say that, you know, we have some
现在假如你有了一些

57
00:01:53,070 --> 00:01:55,840
label data of just a few anomalous
带标签数据

58
00:01:56,330 --> 00:01:57,890
examples of some aircraft engines
也就是有异常的飞机引擎的样本

59
00:01:58,400 --> 00:02:00,780
that were manufactured in the past that turns out to be anomalous.
这批制造的飞机发动机是有问题的

60
00:02:01,520 --> 00:02:02,360
Turned out to be flawed or strange in some way.
可能有瑕疵 或者别的什么问题

61
00:02:02,400 --> 00:02:04,130
Let's say we
同时我们还有

62
00:02:04,360 --> 00:02:05,750
use we also have some non-anomalous
一些无异常的样本

63
00:02:06,100 --> 00:02:07,810
examples, so some
也就是一些

64
00:02:08,050 --> 00:02:10,200
perfectly okay examples.
完全没问题的样本

65
00:02:10,940 --> 00:02:12,050
I'm going to use y equals 0
我用y=0来表示那些

66
00:02:12,110 --> 00:02:13,600
to denote the normal or the
完全正常

67
00:02:13,790 --> 00:02:15,470
non-anomalous example and
没有问题的样本

68
00:02:15,530 --> 00:02:21,450
y equals 1 to denote the anomalous examples.
用y=1来代表那些异常样本

69
00:02:22,450 --> 00:02:24,670
The process of developing and evaluating an anomaly
那么异常检测算法的推导和评价方法

70
00:02:25,130 --> 00:02:26,450
detection algorithm is as follows.
如下所示

71
00:02:27,500 --> 00:02:28,300
We're going to think of it as
我们先考虑

72
00:02:28,560 --> 00:02:29,830
a training set and talk
训练样本

73
00:02:30,000 --> 00:02:31,310
about the cross validation in test
交叉验证和测试集等下考虑

74
00:02:31,440 --> 00:02:32,580
sets later, but the training set we usually
对于训练集

75
00:02:33,280 --> 00:02:34,000
think of this as still the unlabeled
我们还是看成无标签的

76
00:02:35,040 --> 00:02:36,180
training set.
训练集

77
00:02:36,510 --> 00:02:37,250
And so this is our large
所以这些就是

78
00:02:37,560 --> 00:02:39,580
collection of normal, non-anomalous
所有正常的

79
00:02:40,190 --> 00:02:41,130
or not anomalous examples.
或者说无异常样本的集合

80
00:02:42,400 --> 00:02:43,530
And usually we think
通常来讲

81
00:02:43,690 --> 00:02:44,750
of this as being as non-anomalous,
我们把这些都看成无异常的

82
00:02:45,010 --> 00:02:46,490
but it's actually okay even
但可能有一些异常的

83
00:02:46,740 --> 00:02:48,660
if a few anomalies slip into
也被分到你的训练集里

84
00:02:48,660 --> 00:02:51,240
your unlabeled training set.
这也没关系

85
00:02:51,420 --> 00:02:52,100
And next we are going to
接下来我们要

86
00:02:52,310 --> 00:02:53,830
define a cross validation set
定义交叉验证集

87
00:02:54,100 --> 00:02:55,510
and a test set, with which
和测试集

88
00:02:55,750 --> 00:02:58,360
to evaluate a particular anomaly detection algorithm.
通过这两个集合我们将得到异常检测算法

89
00:02:59,230 --> 00:03:00,850
So, specifically, for both the
具体来说

90
00:03:01,000 --> 00:03:01,960
cross validation test sets we're
对交叉验证集和测试集

91
00:03:02,080 --> 00:03:03,590
going to assume that, you know, we
我们将假设

92
00:03:03,800 --> 00:03:05,030
can include a few examples
我们的交叉验证集

93
00:03:05,670 --> 00:03:06,720
in the cross validation set and
和测试集中

94
00:03:06,900 --> 00:03:08,150
the test set that contain examples
有一些样本

95
00:03:08,910 --> 00:03:09,660
that are known to be anomalous.
这些样本都是异常的

96
00:03:10,200 --> 00:03:11,410
So the test sets say
所以比如测试集

97
00:03:11,950 --> 00:03:13,270
we have a few examples with
里面的样本就是

98
00:03:13,340 --> 00:03:14,770
y equals 1 that
带标签y=1的

99
00:03:15,040 --> 00:03:17,470
correspond to anomalous aircraft engines.
这表示有异常的飞机引擎

100
00:03:18,640 --> 00:03:19,800
So here's a specific example.
这是一个具体的例子

101
00:03:20,930 --> 00:03:23,120
Let's say that, altogether, this
假如说

102
00:03:23,280 --> 00:03:24,990
is the data that we have.
这是我们总的数据

103
00:03:25,260 --> 00:03:27,410
We have manufactured 10,000 examples
我们有10000制造的引擎

104
00:03:28,130 --> 00:03:29,140
of engines that, as far
作为样本

105
00:03:29,450 --> 00:03:30,740
as we know we're perfectly normal,
就我们所知 这些样本

106
00:03:31,220 --> 00:03:33,110
perfectly good aircraft engines.
都是正常没有问题的飞机引擎

107
00:03:34,060 --> 00:03:35,240
And again, it turns out to be okay even
同样地 如果有一小部分

108
00:03:35,560 --> 00:03:37,310
if a few flawed engine
有问题的引擎

109
00:03:37,740 --> 00:03:39,400
slips into the set of
也被混入了这10000个样本

110
00:03:39,550 --> 00:03:40,860
10,000 is actually okay, but
别担心 没有关系

111
00:03:40,970 --> 00:03:41,970
we kind of assumed that the vast
我们假设

112
00:03:42,410 --> 00:03:44,300
majority of these
这10000个样本中

113
00:03:44,500 --> 00:03:47,660
10,000 examples are, you know, good and normal non-anomalous engines.
大多数都是好的 没有问题的引擎

114
00:03:48,480 --> 00:03:50,940
And let's say that, you know, historically, however
而且实际上 从过去的经验来看

115
00:03:51,200 --> 00:03:52,120
long we've been running on manufacturing
无论是制造了多少年

116
00:03:52,650 --> 00:03:54,130
plant, let's say that
引擎的工厂

117
00:03:54,480 --> 00:03:55,930
we end up getting features,
我们都会得到这些数据

118
00:03:56,440 --> 00:03:57,970
getting 24 to 28
都会得到大概24到28个

119
00:03:58,240 --> 00:04:00,180
anomalous engines as well.
有问题的引擎

120
00:04:01,120 --> 00:04:03,030
And for a pretty typical application of
对于异常检测的典型应用来说

121
00:04:03,310 --> 00:04:05,490
anomaly detection, you know, the number non-anomalous
异常样本的个数

122
00:04:06,740 --> 00:04:08,090
examples, that is with y equals
也就是y=1的样本

123
00:04:08,760 --> 00:04:10,650
1, we may have anywhere from, you know, 20 to 50.
基本上很多都是20到50个

124
00:04:10,820 --> 00:04:12,920
It would be a pretty typical
通常这个范围

125
00:04:13,360 --> 00:04:14,570
range of examples, number of
对y=1的样本数量

126
00:04:14,830 --> 00:04:16,710
examples that we have with y equals 1.
还是很常见的

127
00:04:16,910 --> 00:04:17,730
And usually we will have a
并且通常我们的

128
00:04:17,860 --> 00:04:20,000
much larger number of good examples.
正常样本的数量要大得多

129
00:04:21,810 --> 00:04:23,150
So, given this data set,
有了这组数据

130
00:04:24,180 --> 00:04:25,410
a fairly typical way to split
把数据分为训练集

131
00:04:25,850 --> 00:04:27,150
it into the training set,
交叉验证集和测试集

132
00:04:27,430 --> 00:04:29,210
cross validation set and test set would be as follows.
一种典型的分法如下

133
00:04:30,390 --> 00:04:31,880
Let's take 10,000 good aircraft
我们把这10000个正常的引擎

134
00:04:32,360 --> 00:04:34,060
engines and put 6,000
放6000个到

135
00:04:34,260 --> 00:04:37,100
of that into the unlabeled training set.
无标签的训练集中

136
00:04:37,620 --> 00:04:38,800
So, I'm calling this an unlabeled training
我叫它“无标签训练集”

137
00:04:39,130 --> 00:04:40,050
set but all of these examples
但其实所有这些样本

138
00:04:40,640 --> 00:04:42,510
are really ones that correspond to
实际上都对应

139
00:04:42,810 --> 00:04:44,380
y equals 0, as far as we know.
y=0的情况 至少据我们所知是这样

140
00:04:45,300 --> 00:04:46,350
And so, we will use this to
所以 我们要用它们

141
00:04:46,520 --> 00:04:48,840
fit p of x, right.
来拟合p(x)

142
00:04:49,150 --> 00:04:49,850
So, we will use these 6000 engines
也就是是我们用这6000个引擎

143
00:04:50,350 --> 00:04:51,180
to fit p of x, which
来拟合p(x)

144
00:04:51,360 --> 00:04:52,190
is that p of x
也就是p 括号

145
00:04:52,420 --> 00:04:53,930
one parametrized by Mu
x1 参数是μ1

146
00:04:54,330 --> 00:04:56,380
1, sigma squared 1, up
σ1的平方

147
00:04:56,540 --> 00:04:57,700
to p of Xn parametrized
一直到p(xn; μn, σn^2)

148
00:04:58,370 --> 00:04:59,570
by Mu N sigma squared
参数是μn σn的平方

149
00:05:00,790 --> 00:05:02,300
n. And so it would be these
因此我们就是要用这

150
00:05:02,500 --> 00:05:03,930
6,000 examples that we would
6000个样本

151
00:05:04,110 --> 00:05:05,370
use to estimate the parameters
来估计参数

152
00:05:05,590 --> 00:05:06,760
Mu 1, sigma squared 1,
μ1, σ1

153
00:05:07,140 --> 00:05:08,960
up to Mu N, sigma
一直到

154
00:05:09,200 --> 00:05:10,280
squared N. And so that's our training
μn, σn

155
00:05:10,500 --> 00:05:11,960
set of all, you know,
这就是训练集中的好的样本

156
00:05:12,150 --> 00:05:13,980
good, or the vast majority of good examples.
或者说大多数好的样本

157
00:05:15,430 --> 00:05:16,950
Next we will take our good
然后 我们取一些

158
00:05:17,140 --> 00:05:18,380
aircraft engines and put some
好的飞机引擎样本

159
00:05:18,660 --> 00:05:19,470
number of them in a cross
放一些到交叉验证集

160
00:05:19,580 --> 00:05:21,320
validation set plus some number
再放一些到

161
00:05:21,570 --> 00:05:22,970
of them in the test sets.
测试集中

162
00:05:23,280 --> 00:05:24,300
So 6,000 plus 2,000 plus 2,000,
正好6000加2000加2000

163
00:05:24,480 --> 00:05:25,470
that's how we split up our
这10000个样本

164
00:05:25,740 --> 00:05:28,820
10,000 good aircraft engines.
就这样进行分割了

165
00:05:29,260 --> 00:05:31,460
And then we also have 20
同时 我们还有20个

166
00:05:31,930 --> 00:05:33,380
flawed aircraft engines, and we'll
异常的发动机样本

167
00:05:33,490 --> 00:05:34,890
take that and maybe split it
同样也把它们进行一个分割

168
00:05:35,160 --> 00:05:36,100
up, you know, put ten of them in
放10个到验证集中

169
00:05:36,200 --> 00:05:37,230
the cross validation set and put
剩下10个

170
00:05:37,370 --> 00:05:39,560
ten of them in the test sets.
放入测试集中

171
00:05:39,850 --> 00:05:41,320
And in the next slide
在下一张幻灯片中

172
00:05:41,660 --> 00:05:42,460
we will talk about how to
我们将看到如何用

173
00:05:42,750 --> 00:05:43,800
actually use this to evaluate
这些分好的数据

174
00:05:44,520 --> 00:05:46,330
the anomaly detection algorithm.
来推导异常检测的算法

175
00:05:48,130 --> 00:05:49,140
So what I have
好的

176
00:05:49,220 --> 00:05:50,610
just described here is a you
刚才我介绍的这些内容

177
00:05:50,790 --> 00:05:52,300
know probably the recommend a
可能是一种

178
00:05:52,440 --> 00:05:55,290
good way of splitting the labeled and unlabeled example.
比较推荐的方法来划分带标签和无标签的数据

179
00:05:55,820 --> 00:05:57,970
The good and the flawed aircraft engines.
来划分好的和坏的飞机引擎样本

180
00:05:58,480 --> 00:06:00,380
Where we use like
我们使用了

181
00:06:00,730 --> 00:06:01,650
a 60, 20, 20% split for
6:2:2的比例

182
00:06:01,800 --> 00:06:03,350
the good engines and we take
来分配好的引擎样本

183
00:06:03,570 --> 00:06:04,780
the flawed engines, and we
而坏的引擎样本

184
00:06:04,910 --> 00:06:05,750
put them just in the cross
我们只把它们放到

185
00:06:05,870 --> 00:06:06,940
validation set, and just in
交叉验证集和测试集中

186
00:06:07,030 --> 00:06:09,200
the test set, then we'll see in the next slide why that's the case.
在下一页中我们将讲解这样分的理由

187
00:06:10,370 --> 00:06:12,080
Just as an aside, if you
顺便说一下

188
00:06:12,360 --> 00:06:13,360
look at how people apply anomaly
如果你看到别人应用

189
00:06:13,750 --> 00:06:15,400
detection algorithms, sometimes you see
异常检测的算法时

190
00:06:15,510 --> 00:06:16,980
other peoples' split the data differently as well.
有时候也可能会有不同的分配方法

191
00:06:17,460 --> 00:06:19,400
So, another alternative, this is really
另一种分配数据的方法是这样的

192
00:06:19,660 --> 00:06:21,290
not a recommended alternative, but
其实我真的不推荐这么分

193
00:06:21,470 --> 00:06:23,650
some people want to
但就有人喜欢这么分

194
00:06:23,790 --> 00:06:24,770
take off your 10,000 good engines, maybe put 6000
也就是把10000个好的引擎分出6000个

195
00:06:24,820 --> 00:06:26,020
of them in your training set
放到训练集中

196
00:06:26,320 --> 00:06:27,130
and then put the same
然后把剩下的4000个样本

197
00:06:27,650 --> 00:06:28,800
4000 in the cross validation
既用作交叉验证集

198
00:06:30,380 --> 00:06:31,020
set and the test set.
也用作测试集

199
00:06:31,170 --> 00:06:32,030
And so, you know, we like to think of the cross
通常来说我们要交叉验证集

200
00:06:32,360 --> 00:06:33,340
validation set and the
和测试集当作是

201
00:06:33,400 --> 00:06:34,620
test set as being completely
完全互不相同的

202
00:06:35,280 --> 00:06:36,370
different data sets to each other.
两个数据组

203
00:06:37,690 --> 00:06:39,030
But you know, in anomaly detection, you know,
但就像我说的 在异常检测中

204
00:06:39,230 --> 00:06:40,360
for sometimes you see
有时候你会发现有些人

205
00:06:40,600 --> 00:06:41,760
people, sort of, use the
会使用相同的

206
00:06:42,070 --> 00:06:42,970
same set of good engines
一部分好的引擎样本

207
00:06:43,370 --> 00:06:44,640
in the cross validation sets, and
用作交叉验证集

208
00:06:44,710 --> 00:06:46,150
the test sets, and sometimes you
也用作测试集

209
00:06:46,250 --> 00:06:48,070
see people use exactly the
并且有时候你还会发现

210
00:06:48,180 --> 00:06:49,800
same sets of anomalous
他们会把同样的一些异常样本

211
00:06:50,880 --> 00:06:54,190
engines in the cross validation set and the test set.
放入交叉验证集合测试集

212
00:06:54,590 --> 00:06:55,970
And so, all of these are considered, you know,
总之 所有这样考虑的

213
00:06:56,850 --> 00:06:59,030
less good practices and definitely less recommended.
都不是一个好的尝试 非常不推荐

214
00:07:00,250 --> 00:07:01,360
Certainly using the same
把交叉验证集

215
00:07:01,650 --> 00:07:02,530
data in the cross validation
和测试集数据

216
00:07:03,200 --> 00:07:04,220
set and the test set, that
混在一起共用

217
00:07:04,510 --> 00:07:06,400
is not considered a good machine learning practice.
确实不是一个比较好的机器学习惯例

218
00:07:07,180 --> 00:07:08,780
But, sometimes you see people do this too.
但这种情况还不少见

219
00:07:09,790 --> 00:07:11,330
So, given the training cross
话说回来 给出之前分好的

220
00:07:11,550 --> 00:07:12,780
validation and test sets,
训练集、交叉验证集和测试集

221
00:07:13,260 --> 00:07:15,220
here's how you evaluate or
异常检测算法的

222
00:07:15,370 --> 00:07:16,920
here is how you develop and evaluate an algorithm.
推导和评估方法如下

223
00:07:18,490 --> 00:07:19,510
First, we take the training sets
首先 我们使用训练样本

224
00:07:19,910 --> 00:07:20,750
and we fit the model p of
来拟合模型p(x)

225
00:07:20,840 --> 00:07:21,860
x. So, we fit, you know, all these
也就是说

226
00:07:22,210 --> 00:07:24,600
Gaussians to my m
我们用所有这些高斯函数

227
00:07:25,060 --> 00:07:26,690
unlabeled examples of aircraft engines,
来拟合m个无标签的飞机引擎样本

228
00:07:27,090 --> 00:07:28,140
and these, I am calling them
虽然这里我称它们为

229
00:07:28,270 --> 00:07:29,560
unlabeled examples, but these are
无标签的样本

230
00:07:29,660 --> 00:07:31,370
really examples that we're assuming
但实际上是我们假设的

231
00:07:31,790 --> 00:07:33,390
our goods are the normal aircraft engines.
都是正常的飞机引擎

232
00:07:34,580 --> 00:07:36,510
Then imagine that your
然后假定你的

233
00:07:36,650 --> 00:07:38,590
anomaly detection algorithm is actually making prediction.
异常检测算法作出了预测

234
00:07:39,030 --> 00:07:40,070
So, on the cross validation
所以 给出交叉验证集

235
00:07:40,630 --> 00:07:42,280
of the test set, given that,
或者测试集

236
00:07:42,610 --> 00:07:44,660
say, test example X, think
给出某个测试样本x

237
00:07:44,840 --> 00:07:46,490
of the algorithm as predicting that
假设这个算法对 p(x)<ε

238
00:07:46,730 --> 00:07:48,090
y is equal to 1, p
的情况作出的

239
00:07:48,230 --> 00:07:49,320
of x is less than epsilon,
预测为 y=1

240
00:07:50,040 --> 00:07:51,740
we must be taking zero, if
而p(x)≥ε时

241
00:07:52,280 --> 00:07:54,760
p of x is
算法作出的预测为

242
00:07:54,950 --> 00:07:57,340
greater than or equal to epsilon.
y=0

243
00:07:58,390 --> 00:07:59,280
So, given x, it's trying to predict, what is
也就是说 给出x的值

244
00:07:59,340 --> 00:08:00,270
the label, given y equals 1 corresponding
预测出y的值 y=1对应

245
00:08:00,500 --> 00:08:01,470
to an anomaly or is
有异常的样本

246
00:08:01,630 --> 00:08:06,380
it y equals 0 corresponding to a normal example?
y=0对应正常样本

247
00:08:07,290 --> 00:08:09,480
So given the training, cross validation, and test sets.
（下面内容重复了，译者注） 给定训练集、交叉验证集和测试集

248
00:08:09,940 --> 00:08:11,080
How do you develop an algorithm?
你应该如何推导算法呢？

249
00:08:11,480 --> 00:08:12,920
And more specifically, how do
或者更具体来说

250
00:08:13,010 --> 00:08:15,450
you evaluate an anomaly detection algorithm?
怎样推导异常检测算法来进行评估呢？

251
00:08:15,790 --> 00:08:17,470
Well, to this whole,
为了推导算法

252
00:08:17,820 --> 00:08:19,410
the first step is to take
我们的第一步是

253
00:08:19,670 --> 00:08:21,130
the unlabeled training set, and
取出所有的无标签的训练样本

254
00:08:21,290 --> 00:08:23,520
to fit the model p of x lead training data.
拟合出模型p(x)

255
00:08:23,990 --> 00:08:25,060
So you take this, you know
虽然我说的是

256
00:08:25,130 --> 00:08:26,620
on I'm coming, unlabeled training set,
无标签的训练集

257
00:08:26,910 --> 00:08:28,390
but really, these are examples
但实际上这些样本

258
00:08:28,870 --> 00:08:30,290
that we are assuming, vast majority
我们已经假设它们

259
00:08:30,750 --> 00:08:32,400
of which are normal aircraft engines,
大多数都是正常的飞机引擎

260
00:08:32,900 --> 00:08:34,020
not because they're not anomalies
是没有异常的

261
00:08:34,150 --> 00:08:35,380
and it will
然后要用这些训练集

262
00:08:35,490 --> 00:08:36,470
fit the model p of x. It
拟合出模型p(x)

263
00:08:36,640 --> 00:08:38,110
will fit all those parameters for all
也就是用所有这些

264
00:08:38,240 --> 00:08:40,330
the Gaussians on this data.
高斯模型拟合出参数

265
00:08:41,560 --> 00:08:43,190
Next on the cross validation
接下来 对交叉验证集

266
00:08:43,300 --> 00:08:44,480
of the test set, we're
和测试集

267
00:08:44,600 --> 00:08:45,460
going to think of the anomaly
我们要让异常检测算法

268
00:08:46,100 --> 00:08:47,530
detention algorithm as trying to
来对y的值

269
00:08:47,640 --> 00:08:48,580
predict the value of
作出一个预测

270
00:08:49,540 --> 00:08:51,670
y. So in each of like
所以 假如对我的每一个

271
00:08:52,430 --> 00:08:53,470
say test examples.
测试样本

272
00:08:54,140 --> 00:08:56,110
We have these X-I tests,
我们有

273
00:08:57,200 --> 00:08:58,720
Y-I test, where y is
(x(i)test, y(i)test)

274
00:08:58,870 --> 00:09:00,100
going to be equal to
其中y=1或0

275
00:09:00,320 --> 00:09:03,230
1 or 0 depending on whether this was an anomalous example.
对应于这个样本是否是异常的

276
00:09:04,370 --> 00:09:05,510
So given input x in
因此 给定测试集中的输入x

277
00:09:05,600 --> 00:09:07,340
my test set, my anomaly detection
我的异常检测算法

278
00:09:07,730 --> 00:09:08,850
algorithm think of it as
将作出预测

279
00:09:09,100 --> 00:09:11,880
predicting the y as 1 if p of x is less than epsilon.
当p(x)小于ε时 预测y=1

280
00:09:12,240 --> 00:09:15,120
So predicting that it is an anomaly, it is probably is very low.
所以 在概率值很小的时候 预测样本是异常的

281
00:09:15,960 --> 00:09:17,810
And we think of the algorithm is predicting that y is equal to 0.
如果p(x)的值大于或等于ε时

282
00:09:17,970 --> 00:09:20,830
If p of x is greater then or equals epsilon.
算法将预测y=0

283
00:09:21,290 --> 00:09:23,600
So predicting those normal example
也就是说如果概率p(x)比较大的时候

284
00:09:24,200 --> 00:09:26,340
if the p of x is reasonably large.
预测该样本为正常样本

285
00:09:27,350 --> 00:09:28,510
And so we can now
所以现在

286
00:09:28,720 --> 00:09:29,930
think of the anomaly detection algorithm
我们可以把

287
00:09:30,580 --> 00:09:32,040
as making predictions for what
异常检测算法想成是

288
00:09:32,240 --> 00:09:33,490
are the values of these y
对交叉验证集

289
00:09:33,830 --> 00:09:35,080
labels in the test sets
和测试集中的y

290
00:09:36,050 --> 00:09:37,000
or on the cross validation set.
进行一个预测

291
00:09:37,720 --> 00:09:39,140
And this puts us somewhat more similar
这样多多少少让人感到

292
00:09:39,670 --> 00:09:41,250
to the supervised learning setting, right?
和监督学习有点类似 不是吗？

293
00:09:41,620 --> 00:09:42,870
Where we have label test
我们有带标签的测试集

294
00:09:43,170 --> 00:09:44,550
set and our algorithm is
而我们的算法就是

295
00:09:44,800 --> 00:09:46,060
making predictions on these labels
对这些标签作出预测

296
00:09:46,190 --> 00:09:48,050
and so we can evaluate it you
所以我们可以通过

297
00:09:48,480 --> 00:09:50,930
know by seeing how often it gets these labels right.
对标签预测正确的次数来进行评价

298
00:09:52,180 --> 00:09:53,820
Of course these labels are
当然 这些标签会比较偏向

299
00:09:54,540 --> 00:09:56,420
will be very skewed because y
因为y=0

300
00:09:56,710 --> 00:09:57,930
equals zero, that is normal
也就是正常的样本

301
00:09:58,300 --> 00:10:00,490
examples, usually be much
肯定是比出现

302
00:10:00,780 --> 00:10:01,930
more common than y equals
y=1 也就是异常样本

303
00:10:02,310 --> 00:10:03,520
1 than anomalous examples.
的情况更多

304
00:10:04,660 --> 00:10:05,610
But, you know, this is much closer
这跟我们在监督学习中

305
00:10:06,040 --> 00:10:06,990
to the source of evaluation
用到的评价度量

306
00:10:07,690 --> 00:10:09,770
metrics we can use in supervised learning.
方法非常接近

307
00:10:12,390 --> 00:10:14,500
So what's a good evaluation metric to use.
那么用什么评价度量好呢？

308
00:10:14,790 --> 00:10:18,530
Well, because the data is
因为数据是非常偏斜的

309
00:10:18,840 --> 00:10:20,450
very skewed, because y equals 0 is
因为y=0是

310
00:10:20,880 --> 00:10:22,980
much more common, classification accuracy
更加常见的

311
00:10:23,520 --> 00:10:24,950
would not be a good the evaluation metrics.
因此分类准确度不是一个好的度量法

312
00:10:25,360 --> 00:10:26,760
So, we talked about this in the earlier video.
我们之前的视频中也讲过

313
00:10:28,360 --> 00:10:29,130
So, if you have a very
如果你有一个

314
00:10:29,410 --> 00:10:31,360
skewed data set, then predicting
比较偏斜的数据集

315
00:10:31,740 --> 00:10:32,750
y equals 0 all the time,
那么总是预测y=0

316
00:10:33,430 --> 00:10:34,300
will have very high classification accuracy.
它的分类准确度自然会很高

317
00:10:35,690 --> 00:10:36,820
Instead, we should use evaluation
取而代之的

318
00:10:37,330 --> 00:10:38,920
metrics, like computing the fraction
我们应该算出

319
00:10:39,530 --> 00:10:41,030
of true positives, false positives,
真阳性、假阳性、

320
00:10:41,670 --> 00:10:42,940
false negatives, true negatives or
假阴性和真阴性的比率

321
00:10:43,580 --> 00:10:44,830
compute the position of the
来作为评价度量值

322
00:10:44,890 --> 00:10:46,370
v curve of this algorithm or
我们也可以算出查准率和召回率

323
00:10:46,790 --> 00:10:48,370
do things like compute the
或者算出

324
00:10:48,520 --> 00:10:50,510
f1 score, right, which is
F1-积分

325
00:10:50,630 --> 00:10:51,680
a single real number way of summarizing
通过一个很简单的数字

326
00:10:52,600 --> 00:10:53,450
the position and the recall numbers.
来总结出查准和召回的大小

327
00:10:54,340 --> 00:10:55,090
And so these would be ways
通过这些方法

328
00:10:55,750 --> 00:10:56,940
to evaluate an anomaly detection
你就可以评价你的异常检测算法

329
00:10:57,320 --> 00:10:59,090
algorithm on your cross validation set or on your test set.
在交叉验证和测试集样本中的表现

330
00:11:01,550 --> 00:11:02,960
Finally, earlier in the
最后一点 之前在

331
00:11:03,100 --> 00:11:05,050
anomaly detection algorithm, we
异常检测算法中

332
00:11:05,200 --> 00:11:06,720
also had this parameter epsilon, right?
我们有一个参数ε对吧？

333
00:11:07,400 --> 00:11:09,100
So, epsilon is this threshold
这个ε是我们用来决定

334
00:11:09,920 --> 00:11:11,320
that we would use to decide
什么时候把一个样本当作是

335
00:11:11,790 --> 00:11:13,630
when to flag something as an anomaly.
异常样本的一个阈值

336
00:11:14,840 --> 00:11:16,740
And so, if you have
所以 如果你有

337
00:11:16,840 --> 00:11:18,380
a cross validation set, another way
一组交叉验证集样本

338
00:11:19,000 --> 00:11:20,320
to and to choose this parameter
一种选择参数ε的方法

339
00:11:20,710 --> 00:11:22,020
epsilon, would be to
就是你可以试一试

340
00:11:22,240 --> 00:11:24,090
try a different, try many
多个不同的

341
00:11:24,340 --> 00:11:26,220
different values of epsilon, and
ε的取值

342
00:11:26,380 --> 00:11:27,520
then pick the value of epsilon
然后选出一个

343
00:11:27,990 --> 00:11:30,670
that, let's say, maximizes f1
使得F1-积分的值最大的那个ε

344
00:11:31,620 --> 00:11:33,940
score, or that otherwise does well on your cross validation set.
也就是在交叉验证集中表现最好的

345
00:11:35,340 --> 00:11:36,770
And more generally, the way
更一般来说

346
00:11:37,000 --> 00:11:38,230
to reduce the training, testing,
我们使用训练集、测试集

347
00:11:38,630 --> 00:11:40,230
and cross validation sets, is that
和交叉验证集的方法是

348
00:11:41,760 --> 00:11:43,020
when we are trying to make decisions,
当我们需要作出决定时

349
00:11:43,640 --> 00:11:45,430
like what features to include, or
比如要包括哪些特征

350
00:11:45,570 --> 00:11:46,580
trying to, you know, tune the parameter
或者说要确定参数ε取多大合适

351
00:11:47,100 --> 00:11:48,160
epsilon, we would then
我们就可以

352
00:11:48,410 --> 00:11:51,010
continually evaluate the algorithm
不断地用交叉验证集

353
00:11:51,500 --> 00:11:52,870
on the cross validation sets and
来评价这个算法

354
00:11:53,000 --> 00:11:54,120
make all those decisions like what
然后决定我们应该

355
00:11:54,320 --> 00:11:55,700
features did you use, you know,
用哪些特征

356
00:11:55,790 --> 00:11:57,650
how to set epsilon, use that, evaluate
怎样选择ε

357
00:11:58,240 --> 00:11:59,410
the algorithm on the cross validation
所以 就是在交叉验证集中

358
00:11:59,880 --> 00:12:00,870
set, and then when we've
评价算法

359
00:12:01,320 --> 00:12:02,770
picked the set of features, when
然后我们选出一组特征

360
00:12:02,910 --> 00:12:03,860
we've found the value of
或者当我们找到了

361
00:12:04,070 --> 00:12:05,150
epsilon that we're happy with, we
能符合我们要求的ε的值后

362
00:12:05,250 --> 00:12:07,030
can then take the final model and
我们就能用最终的模型

363
00:12:07,270 --> 00:12:08,680
evaluate it, you know, do the
来评价这个算法

364
00:12:08,770 --> 00:12:11,360
final evaluation of the algorithm on the test sets.
或者说 用测试集 来最终评价算法的表现

365
00:12:12,680 --> 00:12:13,900
So, in this video, we talked
在这段视频中

366
00:12:14,180 --> 00:12:15,240
about the process of how
我们介绍了

367
00:12:15,520 --> 00:12:17,060
to evaluate an anomaly
如何评价一个异常检测算法

368
00:12:17,520 --> 00:12:18,970
detection algorithm, and again,
同样地

369
00:12:19,960 --> 00:12:21,220
having being able to evaluate
在能够评价算法之后

370
00:12:21,410 --> 00:12:22,540
an algorithm, you know, with
通过一个简单的的

371
00:12:22,640 --> 00:12:23,830
a single real number evaluation,
数值的评价方法

372
00:12:24,730 --> 00:12:25,630
with a number like an F1 score
用一个简单的F1-积分

373
00:12:26,530 --> 00:12:27,660
that often allows you to
这样你就能更有效率地

374
00:12:28,080 --> 00:12:29,710
much more efficient use
在开发异常检测系统时

375
00:12:29,960 --> 00:12:31,160
of your time when you are
更有效率地利用好你的时间

376
00:12:31,280 --> 00:12:33,250
trying to develop an anomaly detection system.
把时间用在刀刃上

377
00:12:33,800 --> 00:12:34,970
And we try to make these sorts of decisions.
我们能够作出决定

378
00:12:35,650 --> 00:12:38,020
I have to chose epsilon, what features to include, and so on.
确定应该如何选取ε 应该包括哪些特征等等

379
00:12:38,970 --> 00:12:39,920
In this video, we started
在这段视频中

380
00:12:40,330 --> 00:12:40,830
to use a bit of labeled
刚开始我们用了一组

381
00:12:41,590 --> 00:12:42,750
data in order to
带标签的数据

382
00:12:43,020 --> 00:12:43,550
evaluate the anomaly detection algorithm and
目的是为了评价异常检测算法

383
00:12:43,570 --> 00:12:45,710
this takes us a
这让我们感觉到

384
00:12:45,890 --> 00:12:48,340
little bit closer to a supervised learning setting.
跟监督学习很相像

385
00:12:49,610 --> 00:12:50,810
In the next video, I'm going
在下一节视频中

386
00:12:51,000 --> 00:12:52,780
to say a bit more about that.
我们将更深入地谈到这个问题

387
00:12:52,940 --> 00:12:54,240
And in particular we'll talk about when should
具体来说 我们将谈到

388
00:12:54,440 --> 00:12:55,860
you be using an anomaly detection
应该如何使用异常检测算法

389
00:12:56,310 --> 00:12:57,130
algorithm and when should we
以及什么时候我们应该用监督学习

390
00:12:57,560 --> 00:13:00,770
be thinking about using supervised learning instead, and what are the differences between these two formalisms.
这两种算法到底有什么区别