srt/18 - 3 - Getting Lots of Data and Artificial Data (16 min).srt

﻿1
00:00:00,090 --> 00:00:01,270
I've seen over and over that
我多次看到(字幕翻译：中国海洋大学，刘竞)

2
00:00:01,570 --> 00:00:03,160
one of the most reliable ways to
一个最可靠的得到

3
00:00:03,300 --> 00:00:04,800
get a high performance machine learning
一个高性能的机器学习

4
00:00:05,040 --> 00:00:06,170
system is to take
系统的方法是

5
00:00:06,550 --> 00:00:07,860
a low bias learning algorithm
采取一个低偏差的机器学习算法

6
00:00:08,750 --> 00:00:10,220
and to train it on a massive training set.
并且在用大量的数据集去训练它。

7
00:00:11,230 --> 00:00:12,830
But where did you get so much training data from?
但是你该从哪里去获得这么多的训练数据呢？

8
00:00:13,510 --> 00:00:14,440
Turns out that the machine earnings
机器收益证明了

9
00:00:14,820 --> 00:00:16,520
there's a fascinating idea called artificial
有一个非常吸引人的思想叫做

10
00:00:17,220 --> 00:00:19,000
data synthesis, this doesn't
人工数据合成，这种思想并不

11
00:00:19,370 --> 00:00:20,740
apply to every single problem, and
适用于每个单独的问题，当把它

12
00:00:20,980 --> 00:00:22,120
to apply to a specific
运用于一个具体的特定

13
00:00:22,360 --> 00:00:25,060
problem, often takes some thought and innovation and insight.
问题时，需要经过一些思考，改进和洞察力。

14
00:00:25,780 --> 00:00:27,170
But if this idea applies
但是假如这个思想应用在

15
00:00:27,580 --> 00:00:29,120
to your machine, only problem, it
在你的机器上，唯一的问题是，

16
00:00:29,230 --> 00:00:30,270
can sometimes be a an
有时，为你的学习算法

17
00:00:30,510 --> 00:00:31,600
easy way to get a
获得一个巨大的

18
00:00:31,680 --> 00:00:33,470
huge training set to give to your learning algorithm.
训练集将是很容易的。

19
00:00:33,900 --> 00:00:35,520
The idea of artificial
人工数据合成

20
00:00:36,230 --> 00:00:38,410
data synthesis comprises of two
包含两个

21
00:00:38,590 --> 00:00:40,210
variations, main the first
变化形式，第一个最主要的

22
00:00:40,650 --> 00:00:41,940
is if we are essentially creating
是我们是否有必要

23
00:00:42,520 --> 00:00:44,940
data from [xx], creating new data from scratch.
从[xx]中去生成数据，也就是从头开始去创新数据。

24
00:00:45,380 --> 00:00:46,700
And the second is if
第二种是我们

25
00:00:46,930 --> 00:00:48,350
we already have it's small
是否已经有了算法的一小

26
00:00:48,590 --> 00:00:49,970
label training set and we
部分标签训练集

27
00:00:50,210 --> 00:00:51,490
somehow have amplify that training
并且以某种方式扩充了训练集

28
00:00:51,840 --> 00:00:52,680
set or use a small training
或者是把一小部分训练集

29
00:00:52,980 --> 00:00:54,390
set to turn that into
转换成了

30
00:00:54,660 --> 00:00:56,290
a larger training set and in
一个较大的数据集

31
00:00:56,450 --> 00:00:58,120
this video we'll go over both those ideas.
在这一个视频中，我们将仔细学习这两种思路。

32
00:01:00,350 --> 00:01:02,220
To talk about the artificial data
在讲到人工数据

33
00:01:02,440 --> 00:01:04,030
synthesis idea, let's use
合成思想时，让我们

34
00:01:04,330 --> 00:01:06,930
the character portion of
借用一下成像光学字符识别

35
00:01:07,090 --> 00:01:08,470
the photo OCR pipeline, we
管道中的字符部分，

36
00:01:08,690 --> 00:01:09,710
want to take it's input image
我们想采用它的输入图像

37
00:01:10,060 --> 00:01:11,370
and recognize what character it is.
去识别出它是什么字符。

38
00:01:13,330 --> 00:01:14,690
If we go out and collect
假如我们出去收集到

39
00:01:14,880 --> 00:01:16,270
a large label data set,
一个大的标签数据集，

40
00:01:16,890 --> 00:01:17,980
here's what it is and what it look like.
也就是它是什么和它像什么。

41
00:01:18,580 --> 00:01:21,770
For this particular example, I've chosen a square aspect ratio.
在这一个特殊的例子中，我选择了一个正方形的长宽比。

42
00:01:22,130 --> 00:01:23,250
So we're taking square image patches.
所以我们采用正方形的图像块。

43
00:01:24,180 --> 00:01:25,110
And the goal is to take
我们的目标是获得

44
00:01:25,770 --> 00:01:27,420
an image patch and recognize the
一个图像块并且识别出

45
00:01:27,530 --> 00:01:29,270
character in the middle of that image patch.
图像块中央的字符。

46
00:01:31,090 --> 00:01:31,990
And for the sake of simplicity,
为了简单，

47
00:01:32,660 --> 00:01:33,740
I'm going to treat these images
我打算把这些图像当做

48
00:01:34,240 --> 00:01:36,380
as grey scale images, rather than color images.
灰度图像来处理，不是把它们当做彩色图像。

49
00:01:36,870 --> 00:01:38,550
It turns out that using color
事实证明把它们当做彩色图像来处理

50
00:01:38,930 --> 00:01:41,180
doesn't seem to help that much for this particular problem.
对于这个具体的问题而言看起来帮助不大。

51
00:01:42,190 --> 00:01:43,530
So given this image patch, we'd
对于这个给定的图像块，

52
00:01:43,660 --> 00:01:44,890
like to recognize that that's a
我们会把它识别为"T"。

53
00:01:45,010 --> 00:01:46,230
T. Given this image patch,
对于这个图像块，

54
00:01:46,550 --> 00:01:47,920
we'd like to recognize that it's an 'S'.
我们会把它识别为一个"S".

55
00:01:49,540 --> 00:01:50,740
Given that image patch we
对于这个图像块，我们

56
00:01:50,850 --> 00:01:52,950
would like to recognize that as an 'I' and so on.
会把它识别为一个“I”，等等。

57
00:01:54,110 --> 00:01:55,310
So all of these, our
因此，对于所有

58
00:01:55,450 --> 00:01:57,240
examples of row images, how
这些行图像的例子，

59
00:01:57,380 --> 00:01:59,460
can we come up with a much larger training set?
我们该如何得到一个更大的训练集呢？

60
00:02:00,000 --> 00:02:01,580
Modern computers often have a
现代计算机通常有一个

61
00:02:01,640 --> 00:02:03,700
huge font library and
庞大的字体库，

62
00:02:03,890 --> 00:02:05,330
if you use a word processing
并且假如你使用一个字处理

63
00:02:05,950 --> 00:02:07,090
software, depending on what word
软件，主要看你使用的

64
00:02:07,240 --> 00:02:08,580
processor you use, you might
是什么字处理软件，你可能

65
00:02:08,800 --> 00:02:09,980
have all of these fonts and
有所有这些字体，

66
00:02:10,120 --> 00:02:12,490
many, many more Already stored inside.
并且还有更多的已经存储在里面了。

67
00:02:12,950 --> 00:02:14,350
And, in fact, if you go different websites, there
并且，事实上，假如你去不同的网站，

68
00:02:14,680 --> 00:02:16,280
are, again, huge, free font
网上还有其它的大的，

69
00:02:16,690 --> 00:02:18,200
libraries on the internet we
免费的字体库，从那里

70
00:02:18,370 --> 00:02:19,960
can download many, many different
我们能下载许多许多不同

71
00:02:20,250 --> 00:02:22,580
types of fonts, hundreds or perhaps thousands of different fonts.
类型的字体，几百甚至是几千种不同的字体。

72
00:02:23,960 --> 00:02:25,180
So if you want more
所以，假如你想

73
00:02:25,570 --> 00:02:27,020
training examples, one thing you
得到更多训练实例，一件

74
00:02:27,100 --> 00:02:28,340
can do is just take
你可以做的事情正是

75
00:02:28,870 --> 00:02:30,220
characters from different fonts
得到不同的字体的字符，

76
00:02:31,240 --> 00:02:32,870
and paste these characters against
并且把这些字符粘贴到

77
00:02:33,290 --> 00:02:35,890
different random backgrounds.
任意不同的背景下。

78
00:02:36,730 --> 00:02:39,500
So you might take this ----  and paste that c against a random background.
所以，你可以得到这些字符C并且把它粘在任意的背景下。

79
00:02:40,680 --> 00:02:41,640
If you do that you now have
假如你做了这些，那么你现在

80
00:02:42,060 --> 00:02:43,830
a training example of an
就有了一个关于

81
00:02:44,080 --> 00:02:45,260
image of the character C.
字符C的图像的训练样例。

82
00:02:46,360 --> 00:02:47,500
So after some amount of
所以在完成一定数量的

83
00:02:47,570 --> 00:02:48,920
work, you know this,
工作之后，你就会发现

84
00:02:48,980 --> 00:02:49,710
and it is a little bit of
合成这些逼真的数据

85
00:02:49,830 --> 00:02:51,760
work to synthisize realistic looking data.
只有很少的工作要做。

86
00:02:52,180 --> 00:02:53,080
But after some amount of work,
在完成一定量的工作之后，

87
00:02:53,700 --> 00:02:56,130
you can get a synthetic training set like that.
你可以得到像那样的合成的训练集。

88
00:02:57,180 --> 00:02:59,910
Every image shown on the right was actually a synthesized image.
在右侧显示的图像实际上是一个合成的图像。

89
00:03:00,360 --> 00:03:02,080
Where you take a font,
当你采用一个字体时，

90
00:03:02,810 --> 00:03:04,240
maybe a random font downloaded off
可能是一个从网上下载的字体，

91
00:03:04,400 --> 00:03:05,680
the web and you paste
你把基于这种字体的

92
00:03:06,160 --> 00:03:07,320
an image of one character
一个字符的图像或者

93
00:03:07,800 --> 00:03:08,870
or a few characters from that font
是几个字符的图像

94
00:03:09,570 --> 00:03:11,440
against this other random background image.
粘贴到另一个任意的背景图像下。

95
00:03:12,140 --> 00:03:12,840
And then apply maybe a little
可以应用一点

96
00:03:13,540 --> 00:03:15,160
blurring operators  -----of app
模糊算子，比如

97
00:03:15,680 --> 00:03:17,380
finder, distortions that app
应用程序取景器，失真，

98
00:03:17,620 --> 00:03:18,650
finder, meaning just the sharing
应用程序取景器，意味着共享，

99
00:03:19,350 --> 00:03:20,740
and scaling and little rotation
缩放和轻微的

100
00:03:21,000 --> 00:03:22,260
operations and if you
旋转操作，假如你

101
00:03:22,370 --> 00:03:23,330
do that you get a synthetic
做了这些，你得到一个

102
00:03:23,580 --> 00:03:25,520
training set, on what the one shown here.
合成的训练集，就是这里显示的这个。

103
00:03:26,510 --> 00:03:28,050
And this is work,
这种工作，

104
00:03:28,530 --> 00:03:29,640
grade, it is, it takes
它也是有好有坏的，

105
00:03:29,970 --> 00:03:31,460
thought at work, in order to
为了使合成的数据更逼真，

106
00:03:31,700 --> 00:03:33,250
make the synthetic data look realistic,
在工作中是需要花费心思的，

107
00:03:34,020 --> 00:03:34,710
and if you do a sloppy
如果在生成合成数据的工作中

108
00:03:35,120 --> 00:03:36,200
job in terms of how
你没有认真去做，

109
00:03:36,250 --> 00:03:38,910
you create the synthetic data then it actually won't work well.
那么所生成的合成数据实际上是不能有效工作的。

110
00:03:39,620 --> 00:03:40,600
But if you look at
但是，假如你看到合成的

111
00:03:40,790 --> 00:03:43,940
the synthetic data looks remarkably similar to the real data.
数据与真实数据非常相似，

112
00:03:45,120 --> 00:03:46,850
And so by using synthetic data
那么使用合成的数据，

113
00:03:47,340 --> 00:03:48,550
you have essentially an unlimited
你就必然能为

114
00:03:48,990 --> 00:03:50,970
supply of training examples for
你的人工训练合成提供

115
00:03:51,310 --> 00:03:53,060
artificial training synthesis And
无限的训练数据样例。

116
00:03:53,150 --> 00:03:54,110
so, if you use this
因此，假如你使用

117
00:03:54,330 --> 00:03:55,820
source synthetic data, you have
这个合成数据源，你就必然

118
00:03:56,150 --> 00:03:58,100
essentially unlimited supply of
就可以利用这些无限的

119
00:03:58,560 --> 00:04:00,000
label data to create
标签数据为字符识别

120
00:04:00,140 --> 00:04:01,610
a improvised learning algorithm
问题生成一个

121
00:04:02,300 --> 00:04:03,990
for the character recognition problem.
学习算法。

122
00:04:05,120 --> 00:04:06,540
So this is an example of
所以这是一个

123
00:04:07,000 --> 00:04:08,500
artificial data synthesis where youre
数据合成的例子，

124
00:04:09,040 --> 00:04:10,870
basically creating new data from
你基本是从零开始

125
00:04:11,080 --> 00:04:13,780
scratch, you just generating brand new images from scratch.
产生数据，也就是你从零开始产生新的图像。

126
00:04:14,880 --> 00:04:16,450
The other main approach to artificial data
另一个主要的人工

127
00:04:16,710 --> 00:04:18,210
synthesis is where you
数据合成方式是

128
00:04:18,370 --> 00:04:19,610
take a examples that you
你使用一个当前

129
00:04:19,740 --> 00:04:20,780
currently have, that we take
已经有的样例，也就是说

130
00:04:21,020 --> 00:04:22,430
a real example, maybe from
我们有一个真实的样例，

131
00:04:22,700 --> 00:04:24,130
real image, and you create
可能是一个真实的图像，

132
00:04:24,770 --> 00:04:26,130
additional data, so as to
你产生附加的数据，

133
00:04:26,380 --> 00:04:27,900
amplify your training set.
以扩充你的训练集。

134
00:04:28,070 --> 00:04:28,810
So here is an image of a compared
这是一个真实图像

135
00:04:28,910 --> 00:04:30,490
to a from a real image,
的对比图像,

136
00:04:31,410 --> 00:04:32,550
not a synthesized image, and
不是一个合成的图像，

137
00:04:32,630 --> 00:04:33,790
I have overlayed this with
我在上面覆盖了网格线

138
00:04:33,880 --> 00:04:35,750
the grid lines just for the purpose of illustration.
只是为了说明问题。

139
00:04:36,430 --> 00:04:36,880
Actually have these ----.
实际上有这许多。

140
00:04:36,970 --> 00:04:39,030
So what you
所以你能做

141
00:04:39,100 --> 00:04:40,110
can do is then take this
的是把字母放在

142
00:04:40,480 --> 00:04:41,500
alphabet here, take this image
这里，向这图像中

143
00:04:42,240 --> 00:04:43,760
and introduce artificial warpings[sp?]
引入一些人工的拉伸

144
00:04:44,290 --> 00:04:45,810
or artificial distortions into the
或者是一些

145
00:04:46,040 --> 00:04:47,030
image so they can
人工的失真，

146
00:04:47,220 --> 00:04:48,240
take the image a and turn
经过这些操作，可以把

147
00:04:48,430 --> 00:04:50,060
that into 16 new examples.
字母A变成这16个新的样例。

148
00:04:51,110 --> 00:04:52,000
So in this way you can
所以采用这种办法，

149
00:04:52,450 --> 00:04:53,740
take a small label training set
你可以得到一个小的标签训练集

150
00:04:54,090 --> 00:04:55,360
and amplify your training set
并且你扩充你的训练集，

151
00:04:56,180 --> 00:04:57,190
to suddenly get a lot
突然得到

152
00:04:57,300 --> 00:05:00,020
more examples, all of it.
更多样例，所有的这些图像。

153
00:05:01,210 --> 00:05:02,360
Again, in order to do
此外，在这一应用中

154
00:05:02,560 --> 00:05:03,940
this for application, it does
所做的这些，

155
00:05:04,120 --> 00:05:05,060
take thought and it does
需要花费心思，

156
00:05:05,140 --> 00:05:06,270
take insight to figure out
需要洞察力去

157
00:05:06,430 --> 00:05:07,840
what our reasonable sets of
判断出合理的失真

158
00:05:08,420 --> 00:05:09,460
distortions, or whether these
操作集，或者是这些操作

159
00:05:09,720 --> 00:05:11,000
are ways that amplify and multiply
是否是扩充和增加

160
00:05:11,470 --> 00:05:12,760
your training set, and for
训练集的方法，

161
00:05:13,070 --> 00:05:15,130
the specific example of
对于字符识别这一

162
00:05:15,260 --> 00:05:17,170
character recognition, introducing these
具体的例子，引入这些

163
00:05:17,480 --> 00:05:18,310
warping seems like a natural
拉伸看起来是一个

164
00:05:18,780 --> 00:05:19,910
choice, but for a
很自然的选择，但是对于不同

165
00:05:20,090 --> 00:05:21,970
different learning machine application, there may
机器学习应用来说，可能

166
00:05:22,080 --> 00:05:24,180
be different the distortions that might make more sense.
另外一些不同的失真将会更合理。

167
00:05:24,860 --> 00:05:25,600
Let me just show one example
让我给大家展示一个

168
00:05:26,180 --> 00:05:28,750
from the totally different domain of speech recognition.
完全不同的语音识别领域的问题。

169
00:05:30,230 --> 00:05:31,480
So the speech recognition, let's say
对于语音识别，假如说

170
00:05:31,580 --> 00:05:33,450
you have audio clips and you
你有音频片段，

171
00:05:33,600 --> 00:05:35,010
want to learn from the audio
你想从中

172
00:05:35,350 --> 00:05:37,240
clip to recognize what were
识别出哪些单词出现在了

173
00:05:37,460 --> 00:05:38,780
the words spoken in that clip.
语音片段中。

174
00:05:39,510 --> 00:05:41,340
So let's see how one labeled training example.
因此，让我们来看看是如何给训练样例加标签的。

175
00:05:42,290 --> 00:05:43,190
So let's say you have one
那么，让我们说假定你已经有了一个

176
00:05:43,400 --> 00:05:45,000
labeled training example, of someone
加了标签的训练样例，就是

177
00:05:45,330 --> 00:05:46,660
saying a few specific words.
某个人在说一些特定的单词。

178
00:05:46,860 --> 00:05:48,720
So let's play that audio clip here.
因此，让我们播放一下这个语音片段。

179
00:05:49,150 --> 00:05:51,230
0 -1-2-3-4-5.
0，1，2，3，4，5.

180
00:05:51,570 --> 00:05:53,810
Alright, so someone
好吧，有人在

181
00:05:54,220 --> 00:05:55,110
counting from 0 to 5,
从0数到5.

182
00:05:55,450 --> 00:05:57,180
and so you want to
然后你想要应用

183
00:05:57,290 --> 00:05:58,460
try to apply a learning algorithm
一个学习算法去试图

184
00:05:59,380 --> 00:06:01,320
to try to recognize the words said in that.
识别出那个人说了哪些单词。

185
00:06:02,040 --> 00:06:04,030
So, how can we amplify the data set?
那么，我们该如何扩充数据集呢？

186
00:06:04,390 --> 00:06:05,340
Well, one thing we do is
很好，我们可以做的一件事情就是

187
00:06:06,020 --> 00:06:09,180
introduce additional audio distortions into the data set.
引入附加的语音失真到数据集中。

188
00:06:09,970 --> 00:06:10,960
So here I'm going to
所以，我将加入

189
00:06:11,640 --> 00:06:14,700
add background sounds to simulate a bad cell phone connection.
加入一些背景声音去模拟一个较差的手机通话连接。

190
00:06:15,360 --> 00:06:16,800
When you hear beeping sounds, that's
当你听到蜂鸣声，实现上

191
00:06:16,980 --> 00:06:17,710
actually part of the audio
这是音频记录的一部分，

192
00:06:17,740 --> 00:06:20,350
track, that's nothing wrong with the speakers, I'm going to play this now.
不是说话者的错误，现在我开始播放了。

193
00:06:20,580 --> 00:06:21,379
0-1-2-3-4-5.
0，1，2，3，4，5.

194
00:06:21,380 --> 00:06:22,260
Right, so you can listen
好了，你只可以听到

195
00:06:22,640 --> 00:06:24,890
to that sort of audio clip and
那种音频片段， 并且

196
00:06:25,720 --> 00:06:28,600
recognize the sounds,
识别出声音，

197
00:06:28,960 --> 00:06:30,800
that seems like another useful training
这看起来像是另外一种

198
00:06:31,370 --> 00:06:33,230
example to have, here's another example, noisy background.
值得拥有的训练样例。这是另外一种例子，吵杂的背景。

199
00:06:34,890 --> 00:06:36,870
Zero, one, two, three
0，1，2，3

200
00:06:37,560 --> 00:06:39,060
four five you know
4，5，在背景中还有

201
00:06:39,090 --> 00:06:40,280
of cars driving past, people walking
行使的汽车经过，人在走路。

202
00:06:40,580 --> 00:06:42,200
in the background, here's another
这是另外

203
00:06:42,450 --> 00:06:43,880
one, so taking the original
一个，使用原来的干净的音频片段，

204
00:06:44,430 --> 00:06:45,980
clean audio clip so
原来那个音频片段

205
00:06:46,090 --> 00:06:47,810
taking the clean audio of
有人在说

206
00:06:47,990 --> 00:06:48,960
someone saying 0 1 2 3
0,1,2,3

207
00:06:49,090 --> 00:06:50,490
4 5 we can then automatically
4,5,然后，我们能自动

208
00:06:51,790 --> 00:06:54,090
synthesize these additional training
合成这些附加的

209
00:06:54,470 --> 00:06:55,850
examples and thus amplify
训练集，把单个的训练集

210
00:06:56,410 --> 00:06:57,860
one training example into maybe four different training examples.
扩展到可能四种不同的训练样例。

211
00:07:00,110 --> 00:07:00,940
So let me play this final
所以让我也播放一下最后

212
00:07:01,300 --> 00:07:03,180
example, as well.
这个样例。

213
00:07:03,340 --> 00:07:07,180
0-1 3-4-5 So by
0，1，3，4，5. 因此，通过

214
00:07:07,530 --> 00:07:08,510
taking just one labelled example,
使用一个已经加了标签的样例，

215
00:07:09,000 --> 00:07:10,260
we have to go through the effort
我们不得不通过努力

216
00:07:10,360 --> 00:07:11,760
to collect just one labelled example
去以收集另一个带

217
00:07:11,950 --> 00:07:13,270
fall of the 01205, and by
标签的样例01205，通过

218
00:07:14,140 --> 00:07:16,520
synthesizing additional distortions,
合成附加的失真，

219
00:07:17,290 --> 00:07:18,560
by introducing different background sounds,
通过引入不同的背景声音，

220
00:07:19,000 --> 00:07:20,240
we've now multiplied this one
我们把一个样例扩充

221
00:07:20,370 --> 00:07:21,810
example into many more examples.
成许多更多的样例。

222
00:07:23,420 --> 00:07:24,480
Much work by just automatically
最的做多的工作就是自动

223
00:07:25,270 --> 00:07:27,090
adding these different background sounds
添加许多不同的背景声音

224
00:07:27,680 --> 00:07:30,510
to the clean audio Just
到干净的音频中。对于引入失真

225
00:07:30,740 --> 00:07:31,980
one word of warning about synthesizing
去合成数据，

226
00:07:33,190 --> 00:07:35,220
data by introducing distortions: if
只有一点要提醒的是

227
00:07:35,310 --> 00:07:36,630
you try to do this
假如你尝试

228
00:07:36,810 --> 00:07:38,580
yourself, the distortions you
自己去完成，你引入的失真

229
00:07:39,020 --> 00:07:40,300
introduce should be representative the source
应该对于背景声音

230
00:07:40,660 --> 00:07:42,010
of noises, or distortions, that
或者失真而言具有代表性，

231
00:07:42,110 --> 00:07:43,680
you might see in the test set.
这一点你会在测试集中发现。

232
00:07:44,010 --> 00:07:45,350
So, for the character recognition example,
所以，对于字符识别的例子，

233
00:07:45,930 --> 00:07:47,230
you know, the working things
你知道，开始引入的

234
00:07:47,440 --> 00:07:48,620
begin introduced are actually kind
可工作的事物实际上是

235
00:07:48,770 --> 00:07:49,980
of reasonable, because an image
某种合理的事物，因为

236
00:07:50,340 --> 00:07:51,510
A that looks like that, that's,
一个A的图像就是像那个样子，

237
00:07:52,000 --> 00:07:53,020
could be an image that
也就是说，是一个我们实际上可以在

238
00:07:53,210 --> 00:07:55,170
we could actually see in a test set.Reflect
一个测试集中看到的那样的图像。

239
00:07:55,370 --> 00:07:57,180
a fact And, you know, that
回想一下这一事实。 你知道，

240
00:07:57,380 --> 00:08:00,200
image on the upper-right, that
右上边的那个图像是一个

241
00:08:00,350 --> 00:08:01,800
could be an image that we could imagine seeing.
在我们想象中能看到的图像。

242
00:08:03,280 --> 00:08:04,570
And for audio, well, we do
对于音频而言，

243
00:08:04,740 --> 00:08:06,560
wanna recognize speech, even against
我们想要识别出的讲话，

244
00:08:06,970 --> 00:08:07,990
a bad self internal connection, against
即使是在一个差的在各种不同类型的

245
00:08:08,480 --> 00:08:09,440
different types of background noise, and
背景噪音中的自我内部连接，

246
00:08:09,590 --> 00:08:10,920
so for the audio, we're again
因此，对于音频，我们

247
00:08:11,230 --> 00:08:12,800
synthesizing examples are actually
再次合成样例，它们能够

248
00:08:13,530 --> 00:08:14,770
representative of the sorts of
代表不同类型的

249
00:08:14,850 --> 00:08:15,830
examples that we want to
我们想要区分开的样例，

250
00:08:15,990 --> 00:08:17,360
classify, that we want to recognize correctly.
而且我们也想要能够正确的识别。

251
00:08:18,770 --> 00:08:20,660
In contrast, usually it does
相反，通常当你

252
00:08:20,770 --> 00:08:21,940
not help perhaps you actually
把一个有意义的数据当作噪音

253
00:08:22,170 --> 00:08:23,760
a meaning as noise to your data.
加入到你的数据中，对工作没有多大帮助。

254
00:08:24,420 --> 00:08:25,170
I'm not sure you can see
我不确定你能明白这个，

255
00:08:25,440 --> 00:08:26,400
this, but what we've done
但是我们这里已经做的

256
00:08:26,620 --> 00:08:28,050
here is taken the image, and
就是拍摄了照片，

257
00:08:28,210 --> 00:08:29,540
for each pixel, in each
并且对于4幅

258
00:08:29,720 --> 00:08:30,710
of these 4 images, has just
图中的每一个像素，都加入

259
00:08:30,990 --> 00:08:32,970
added some random Gaussian noise to each pixel.
一些随机的高斯噪音。

260
00:08:33,240 --> 00:08:34,690
To each pixel, is the
对于每一个像素，

261
00:08:35,060 --> 00:08:36,370
pixel brightness, it would
也就是像素亮度，

262
00:08:36,500 --> 00:08:38,880
just add some, you know, maybe Gaussian random noise to each pixel.
加入一些可能是高斯随机噪音。

263
00:08:39,360 --> 00:08:40,940
So it's just a totally meaningless noise, right?
所以，它只是一个完全没有意义的噪音，对吧？

264
00:08:41,650 --> 00:08:43,280
And so, unless you're expecting
因此，除非你期望在你

265
00:08:43,800 --> 00:08:45,510
to see these sorts of pixel
的测试集中看到这种

266
00:08:45,910 --> 00:08:46,830
wise noise in your test
像素间的对比噪音，

267
00:08:46,910 --> 00:08:48,190
set, this sort of
那么这种纯随机的无意义

268
00:08:48,660 --> 00:08:51,540
purely random meaningless noise is less likely to be useful.
的噪音将很可能就是无用的。

269
00:08:52,880 --> 00:08:53,750
But the process of artificial
但是这种人工

270
00:08:54,250 --> 00:08:55,570
data synthesis it is you
数据合成的过程是你

271
00:08:55,640 --> 00:08:56,660
know a little bit of
知道一点技巧，

272
00:08:56,710 --> 00:08:57,850
an art as well and sometimes
并且有时候你

273
00:08:58,140 --> 00:09:00,250
you just have to try it and see if it works.
仅仅是尝试一下，看它是否起作用。

274
00:09:01,280 --> 00:09:02,060
But if you're trying to
但是当你在决定

275
00:09:02,140 --> 00:09:03,170
decide what sorts of distortions
该添加哪种

276
00:09:03,870 --> 00:09:04,720
to add, you know, do
失真时，

277
00:09:04,820 --> 00:09:06,260
think about what other meaningful
考虑其他你

278
00:09:06,670 --> 00:09:08,180
distortions you might add that
可能添加的有意义的失真，

279
00:09:08,660 --> 00:09:09,720
will cause you to generate additional
可能会导致你生成一些附加

280
00:09:10,110 --> 00:09:11,370
training examples that are at
训练样例，它们相对于你希望在

281
00:09:11,880 --> 00:09:13,410
least somewhat representative of the
测试集中见到的图像而言，不具有

282
00:09:13,480 --> 00:09:15,830
sorts of images you expect to see in your test sets.
代表性.

283
00:09:18,100 --> 00:09:19,000
Finally, to wrap up this
最后，在完成这一视频时，

284
00:09:19,150 --> 00:09:19,920
video, I just wanna say
我想说

285
00:09:20,140 --> 00:09:21,420
a couple of words, more about
几句话，更多的是关于

286
00:09:21,790 --> 00:09:23,360
this idea of getting loss
通过数据合成去解决

287
00:09:23,600 --> 00:09:25,610
of data via artificial data synthesis.
数据亏损问题。

288
00:09:26,920 --> 00:09:28,780
As always, before expending a lot
像以前一样，在作出

289
00:09:29,170 --> 00:09:30,280
of effort, you know, figuring out
努力之前，应该想到

290
00:09:30,450 --> 00:09:32,020
how to create artificial training
如何生成人工训练

291
00:09:33,060 --> 00:09:34,140
examples, it's often a good
样例，这是一个很好的

292
00:09:34,220 --> 00:09:35,310
practice is to make sure
做法，它可以确保

293
00:09:35,650 --> 00:09:36,540
that you really have a low biased
你真正有一个非常低偏差的

294
00:09:36,920 --> 00:09:38,350
crossfire, and having a
问题，有更多的

295
00:09:38,460 --> 00:09:40,320
lot more training data will be of help.
训练数据将会更有帮助。

296
00:09:41,010 --> 00:09:41,840
And standard way to do
一个标准物方法是

297
00:09:41,970 --> 00:09:42,810
this is to plot the learning
绘制一个学习

298
00:09:43,030 --> 00:09:43,970
curves, and make sure that
曲线，确定你只有

299
00:09:44,130 --> 00:09:44,920
you only have a low
一个低偏差

300
00:09:45,000 --> 00:09:47,470
as well, high variance falsifier.
同时也是高方差的伪造器。

301
00:09:47,760 --> 00:09:48,650
Or if you don't have a low
或者如果你没有一个低

302
00:09:48,720 --> 00:09:50,090
bias falsifier, you know,
偏差的伪造器，那么，

303
00:09:50,160 --> 00:09:51,040
one other thing that's worth trying
另外一件值得尝试的事情

304
00:09:51,450 --> 00:09:53,270
is to keep increasing the number
就是持续增加

305
00:09:53,540 --> 00:09:54,440
of features that your classifier
你的分类器的特征数量，

306
00:09:54,600 --> 00:09:55,650
has, increasing the number of
增加你的

307
00:09:55,740 --> 00:09:56,710
hidden units in your network,
网络的隐藏单元的数量，

308
00:09:57,180 --> 00:09:58,470
saying, until you actually have a
也就是说，直到你实际上有了

309
00:09:58,540 --> 00:10:00,000
low bias falsifier, and only
一个低偏差的伪造器，

310
00:10:00,310 --> 00:10:01,820
then, should you put
并且是直到那时，你才应该

311
00:10:02,040 --> 00:10:04,020
the effort into creating a
花费力气去生成

312
00:10:04,260 --> 00:10:05,760
large, artificial training set, so
大量的，人工的训练集，所以

313
00:10:05,860 --> 00:10:06,660
what you really want to avoid
你真正想要避免的

314
00:10:06,870 --> 00:10:07,930
is to, you know, spend
就是，花费一

315
00:10:08,110 --> 00:10:08,890
a whole week or spend a few
整个星期或者是花费

316
00:10:09,090 --> 00:10:10,370
months figuring out how
几个月去想该

317
00:10:10,540 --> 00:10:11,720
to get a great artificially
如何去得到一个大的

318
00:10:12,450 --> 00:10:13,260
synthesized data set.
人工合成的数据集。

319
00:10:13,820 --> 00:10:15,520
Only to realize afterward, that,
只实现后面的工作，

320
00:10:15,760 --> 00:10:17,410
you know, your learning algorithm, performance
你知道的，即使当你有大量的训练集，

321
00:10:18,030 --> 00:10:20,730
doesn't improve that much, even when you're given a huge training set.
你的学习算法的性能也不可能改进太多。

322
00:10:22,190 --> 00:10:23,060
So that's about my usual advice
所以，这也是我通常的

323
00:10:23,420 --> 00:10:24,690
about of a testing that
关于测试的一个建议，

324
00:10:25,030 --> 00:10:26,290
you really can make use
在你花费很大的力气去

325
00:10:26,530 --> 00:10:27,760
of a large training set before
寻找大的训练集以前，

326
00:10:28,080 --> 00:10:30,530
spending a lot of effort going out to get that large training set.
你可以真正利用这样一个大的训练集。

327
00:10:31,960 --> 00:10:33,280
Second is, when i'm working
第二，就是当我

328
00:10:33,590 --> 00:10:35,250
on machine learning problems, one question
在研究机器学习问题时，我经常问

329
00:10:35,690 --> 00:10:37,520
I often ask the team
和我一起工作的小组，

330
00:10:37,880 --> 00:10:39,210
I'm working with, often ask my
我也经常问我的学生

331
00:10:39,430 --> 00:10:40,550
students, which is, how much work
的一个问题，就是获得我当前拥有的数据

332
00:10:40,620 --> 00:10:42,810
would it be to get 10 times as much date as we currently had.
的10倍的数据将会花费多少工作。

333
00:10:46,720 --> 00:10:47,850
When I face a new machine
当我面临一个新的

334
00:10:48,200 --> 00:10:49,760
learning application very often I
机器学习应用时，通常

335
00:10:49,980 --> 00:10:50,940
will sit down with a team
我会和整个小组一起坐下来，

336
00:10:51,210 --> 00:10:52,440
and ask exactly this question,
仔细询问这个问题，

337
00:10:52,920 --> 00:10:53,870
I've asked this question over and
我已经反复的问过这个问题，

338
00:10:53,970 --> 00:10:55,870
over and over and I've
并且，我一直也很

339
00:10:56,000 --> 00:10:57,540
been very surprised how often
惊讶问题的答案

340
00:10:58,390 --> 00:10:59,660
this answer has been that.
通常都是那样。

341
00:11:00,010 --> 00:11:01,070
You know, it's really not that hard,
你知道，它其实没有那么难，

342
00:11:01,680 --> 00:11:02,670
maybe a few days of work
可能至多是

343
00:11:02,930 --> 00:11:03,930
at most, to get ten times
几天的工作，就可以获得

344
00:11:04,250 --> 00:11:05,300
as much data as we currently
10倍于当前

345
00:11:05,450 --> 00:11:06,650
have for a machine
我们的机器

346
00:11:06,810 --> 00:11:08,820
running application and very
运行应用的数据,并且

347
00:11:09,080 --> 00:11:09,830
often if you can get
通常是假如你有当前

348
00:11:09,950 --> 00:11:11,030
ten times as much data there
的十倍的数据，那么

349
00:11:11,270 --> 00:11:13,680
will be a way to make your algorithm do much better.
将会有办法使你的算法工作得更好。

350
00:11:14,060 --> 00:11:15,040
So, you know, if you
因此，假如

351
00:11:15,260 --> 00:11:16,510
ever join the product team
你曾经参加了

352
00:11:17,820 --> 00:11:18,880
working on some machine learning
过专门从事研发一些

353
00:11:19,110 --> 00:11:20,430
application product this is
机器学习应用产品的小组，

354
00:11:20,550 --> 00:11:21,710
a very good questions ask yourself
这是一个很好的你应该问自己、

355
00:11:22,290 --> 00:11:23,500
ask the team don't be
问小组的问题，不要太惊讶

356
00:11:23,650 --> 00:11:25,120
too surprised if after a
假如经过几分钟

357
00:11:25,240 --> 00:11:26,530
few minutes of brainstorming if your
的讨论后，你的小组就想到

358
00:11:26,650 --> 00:11:27,520
team comes up with a
了可以得到

359
00:11:27,660 --> 00:11:28,950
way to get literally ten
毫不夸张的十倍于

360
00:11:29,200 --> 00:11:30,250
times this much data, in
现在拥有的数据的办法，

361
00:11:30,380 --> 00:11:31,320
which case, I think you would
在这种情况下，

362
00:11:31,430 --> 00:11:32,330
be a hero to that team,
你将成为小组的英雄，

363
00:11:32,940 --> 00:11:34,000
because with 10 times as
因为如果有了十倍多的数据，

364
00:11:34,240 --> 00:11:35,360
much data, I think you'll really
并且从这么多的数据中学习，

365
00:11:35,450 --> 00:11:38,460
get much better performance, just from learning from so much data.
我认为你的算法将会有更好的性能。

366
00:11:39,650 --> 00:11:44,500
So there are several waysand
因此，有几个方法

367
00:11:47,450 --> 00:11:48,510
that comprised both the ideas
可以包括利用

368
00:11:48,970 --> 00:11:50,440
of generating data from
随机字体

369
00:11:50,640 --> 00:11:53,050
scratch using random fonts and so on.
来从头产生数据等思想。

370
00:11:53,570 --> 00:11:54,430
As well as the second idea
同样的，第二个思想

371
00:11:54,840 --> 00:11:56,600
of taking an existing example and
是利用现有的例子，

372
00:11:56,670 --> 00:11:58,100
and introducing distortions that amplify
引入一些失真

373
00:11:58,280 --> 00:12:00,910
to enlarge the training set A
来增加、扩大训练集。

374
00:12:01,090 --> 00:12:02,150
couple of other examples of
一些其它获得得更多

375
00:12:02,280 --> 00:12:03,130
ways to get a lot more
数据的方法

376
00:12:03,270 --> 00:12:04,610
data are to collect the
的例子就是就是收集数据

377
00:12:04,670 --> 00:12:06,600
data or to label them yourself.
或者自己给他们加标签。

378
00:12:07,600 --> 00:12:09,090
So one useful calculation that
因此一个我经常

379
00:12:09,210 --> 00:12:11,580
I often do is, you know,
做的有用的计算就是

380
00:12:11,780 --> 00:12:13,320
how many minutes, how many
获得一定数量

381
00:12:13,520 --> 00:12:15,140
hours does it take to
的样例需要

382
00:12:15,350 --> 00:12:16,420
get a certain number of
多少分钟、多少小时，

383
00:12:16,610 --> 00:12:17,780
examples, so actually sit down and
所以实际上是

384
00:12:17,900 --> 00:12:19,410
figure out, you know, suppose it
坐下来估计一下，假如

385
00:12:19,550 --> 00:12:21,830
takes me ten seconds to
给一个样例加标签

386
00:12:22,060 --> 00:12:23,990
label one example then
需要十分秒钟，然后，再假设

387
00:12:24,120 --> 00:12:25,820
and, suppose that, for
在我们的应用中，

388
00:12:26,190 --> 00:12:29,050
our application, currently we
我们当前

389
00:12:29,190 --> 00:12:31,500
have 1000 labeled examples examples
有1000个标记过的样例。

390
00:12:31,620 --> 00:12:32,730
so ten times as
所以得到

391
00:12:32,860 --> 00:12:34,090
much of that would be
十倍数据的时间

392
00:12:34,200 --> 00:12:35,940
if n were equal to ten thousand.
n就是一万秒。

393
00:12:37,440 --> 00:12:40,260
A second way to
第二个获得

394
00:12:40,400 --> 00:12:41,530
get a lot of data is
更多数据的办法就是

395
00:12:41,800 --> 00:12:43,540
to just collect the data and you label it yourself.
收集数据，然后自己标记它。

396
00:12:44,510 --> 00:12:45,380
So what I mean by this is
所以，我这么说的意思是

397
00:12:45,690 --> 00:12:46,970
I will often set down and
我经常会坐下来

398
00:12:47,240 --> 00:12:48,570
do a calculation to figure
并且计算出

399
00:12:48,950 --> 00:12:50,190
out how much time, you
需要多少时间，

400
00:12:50,350 --> 00:12:51,140
know just like how many hours
比如你应该知道

401
00:12:52,640 --> 00:12:54,000
will it take, how many
这需要多少小时，也就是

402
00:12:54,200 --> 00:12:55,130
hours or how many days will
你或者其它人

403
00:12:55,230 --> 00:12:56,890
it take for me or
坐下来并且

404
00:12:57,020 --> 00:12:58,400
for someone else to just sit
收集十倍于

405
00:12:58,640 --> 00:12:59,870
down and collect ten times
当前所拥有的

406
00:13:00,190 --> 00:13:01,490
as much data, as we have
数据，并给他们

407
00:13:01,800 --> 00:13:03,560
currently, by collecting the data ourselves and labeling them ourselves.
加标签需要多少小时，多少天。

408
00:13:05,260 --> 00:13:06,550
So, for example, that, for
例如，

409
00:13:06,630 --> 00:13:08,200
our machine learning application, currently
我们的机器学习应用，当前

410
00:13:08,690 --> 00:13:10,180
we have 1,000 examples, so M 1,000.
我们有1000个样例，所以M等于1000.

411
00:13:12,010 --> 00:13:12,750
That what we do is sit
所以我们做的就是

412
00:13:12,870 --> 00:13:14,500
down and ask, how long does
就下来，问一下，实际收集

413
00:13:14,720 --> 00:13:16,930
it take me really to collect and label one example.
并给一个样例加标签需要多少时间。

414
00:13:17,340 --> 00:13:18,480
And sometimes maybe it will
有时它可能会

415
00:13:18,600 --> 00:13:19,510
take you, you know ten
花费你十秒

416
00:13:19,790 --> 00:13:22,100
seconds to label
去给一个新样例

417
00:13:23,310 --> 00:13:25,120
one new example, and so
加标签，因此，如果我需要

418
00:13:25,520 --> 00:13:27,720
if I want 10 X as many examples, I'd do a calculation.
十倍多的样例，我就会做一个计算。

419
00:13:28,360 --> 00:13:30,400
If it takes me 10 seconds to get one training example.
假如我需要10秒去得到一个训练样例，

420
00:13:31,370 --> 00:13:32,340
If I wanted to get 10
如果我需要10倍多数据，

421
00:13:32,580 --> 00:13:35,320
times as much data, then I need 10,000 examples.
那么我需要10000个样例。

422
00:13:35,830 --> 00:13:38,470
So I do the calculation, how long
所以我做一下计算，看一下

423
00:13:38,770 --> 00:13:40,380
is it gonna take to label,
加标签需要多少时间，也就是手工

424
00:13:40,840 --> 00:13:42,640
to manually label 10,000 examples,
给10000个样例加标签需要多少时间。

425
00:13:43,340 --> 00:13:45,280
if it takes me 10 seconds to label 1 example.
假如给一个样例加标签需要10秒。

426
00:13:47,070 --> 00:13:47,940
So when you do this calculation,
所以当你计算的时候，

427
00:13:48,840 --> 00:13:49,920
often I've seen many you
我经常看到许多人

428
00:13:50,390 --> 00:13:51,780
would be surprised, you know,
会很吃惊，时间多么短，

429
00:13:51,870 --> 00:13:53,140
how little, or sometimes a
有时只是

430
00:13:53,240 --> 00:13:54,730
few days at work, sometimes a
几天的工作，

431
00:13:54,880 --> 00:13:55,560
small number of days of work,
有时只是少数的几天，

432
00:13:55,780 --> 00:13:57,180
well I've seen many teams be very
我已经看过许多小组

433
00:13:57,500 --> 00:13:59,160
surprised that sometimes how
感到很吃惊，

434
00:13:59,340 --> 00:14:00,280
little work it could be,
获得更多的数据

435
00:14:00,410 --> 00:14:01,200
to just get a lot more
需要的工作是多么少，

436
00:14:01,370 --> 00:14:02,510
data, and let that be
把这个

437
00:14:02,580 --> 00:14:03,470
a way to give your learning
做为大力提高

438
00:14:03,580 --> 00:14:04,310
app to give you a huge boost
提高你机器学习应用

439
00:14:04,640 --> 00:14:06,350
in performance, and necessarily, you
性能的方法，你知道这是必然的，

440
00:14:06,450 --> 00:14:07,550
know, sometimes when you've just
有时当你实际

441
00:14:07,790 --> 00:14:08,900
managed to do this, you
做到这一点了，

442
00:14:09,190 --> 00:14:10,780
will be a hero and whatever product
你将是一个英雄，无论是

443
00:14:11,360 --> 00:14:12,520
development, whatever team you're working
什么产品开发，

444
00:14:12,910 --> 00:14:14,150
on, because this can
无论你在哪个小组工作，

445
00:14:14,320 --> 00:14:15,760
be a great way to get much better performance.
因为这是一个非常好提高性能的方法。

446
00:14:17,650 --> 00:14:19,490
Third and finally, one sometimes
第三个也是最后一个，

447
00:14:20,020 --> 00:14:21,230
good way to get a
有时一个获得更多

448
00:14:21,450 --> 00:14:22,650
lot of data is to use
数据的方法

449
00:14:23,080 --> 00:14:24,350
what's now called crowd sourcing.
叫做众包。

450
00:14:25,280 --> 00:14:26,350
So today, there are a
现在，有一些

451
00:14:26,520 --> 00:14:27,270
few websites or a few
网站，或者是一些服务，

452
00:14:27,460 --> 00:14:29,520
services that allow you
允许你

453
00:14:29,920 --> 00:14:32,210
to hire people on
在网上雇一些人，

454
00:14:32,350 --> 00:14:33,410
the web to, you know, fairly
你知道的，可以非常便宜的

455
00:14:33,730 --> 00:14:36,140
inexpensively label large training sets for you.
帮你标记大量的训练集。

456
00:14:36,810 --> 00:14:37,870
So this idea of crowd
所以众包的思想，

457
00:14:38,190 --> 00:14:39,460
sourcing, or crowd sourced
或者叫众包

458
00:14:39,950 --> 00:14:41,390
data labeling, is something
标记数据，

459
00:14:41,810 --> 00:14:43,180
that has, is obviously, like
很明显有点像

460
00:14:43,340 --> 00:14:45,200
an entire academic literature,
一个完整的学术文化，

461
00:14:45,660 --> 00:14:47,040
has some of it's own complications and
有它自己的诸多问题，

462
00:14:47,210 --> 00:14:49,390
so on, pertaining to labeler reliability.
它很依赖于加标记的人的可靠性。

463
00:14:50,440 --> 00:14:51,470
Maybe, you know, hundreds of thousands
可能，你知道，世界上

464
00:14:51,860 --> 00:14:53,420
of labelers, around the
有成千上万的

465
00:14:53,580 --> 00:14:55,530
world, working fairly inexpensively to
标记人员在廉价地工作，

466
00:14:55,630 --> 00:14:56,810
help label data for you,
帮你给数据加标签，

467
00:14:57,030 --> 00:14:58,580
and that I've just had mentioned,
正如我刚才所说的，

468
00:14:58,930 --> 00:15:00,120
there's this one alternative as well.
这也是一个可供选择的办法。

469
00:15:00,390 --> 00:15:02,170
And probably Amazon Mechanical Turk
亚马逊土耳其机器人

470
00:15:02,510 --> 00:15:03,750
systems is probably the most
系统可能是当前

471
00:15:03,900 --> 00:15:05,860
popular crowd sourcing option right now.
最流行的众包选择。

472
00:15:06,860 --> 00:15:08,070
This is often quite a
假如你想

473
00:15:08,220 --> 00:15:10,040
bit of work to
得到高质量的标签，

474
00:15:10,190 --> 00:15:10,940
get to work, if you want
这个通常是

475
00:15:11,150 --> 00:15:12,520
to get very high quality labels,
很容易完成的工作，

476
00:15:12,780 --> 00:15:14,160
but is sometimes an
但是这有时也是一个

477
00:15:14,240 --> 00:15:15,760
option worth considering as well.
很值得考虑的选择。

478
00:15:17,330 --> 00:15:18,870
If you want to try to
如果你想

479
00:15:19,320 --> 00:15:21,000
hire many people, fairly inexpensively
在网上花费不多

480
00:15:21,810 --> 00:15:24,220
on the web, our labels launch miles of data for you.
雇佣很多人，我们的标记可以做为数据投放到市场。

481
00:15:26,320 --> 00:15:27,570
So this video, we
所以，这一个视频，

482
00:15:27,660 --> 00:15:28,840
talked about the idea of
我们讨论了

483
00:15:29,100 --> 00:15:30,870
artificial data synthesis of
利用人工数据合成的方法

484
00:15:31,120 --> 00:15:32,440
either creating new data
从头开始

485
00:15:32,750 --> 00:15:34,400
from scratch, looking, using
生成新的数据，察看夯基金

486
00:15:34,640 --> 00:15:35,400
the ramming funds as an example,
并以它为例，

487
00:15:35,830 --> 00:15:37,710
or by amplifying an
或者是使用

488
00:15:37,790 --> 00:15:38,980
existing training set, by taking
现有的训练集，引入

489
00:15:39,420 --> 00:15:41,340
existing label examples and

490
00:15:41,560 --> 00:15:42,980
introducing distortions to it,
一些失真，以生成另外

491
00:15:43,240 --> 00:15:44,880
to sort of create extra label examples.
以生成额外的标签样例。

492
00:15:46,010 --> 00:15:47,450
And finally, one thing that
最后一件

493
00:15:47,630 --> 00:15:48,810
I hope you remember from this
我希望你通过这个视频

494
00:15:49,120 --> 00:15:49,970
video this idea of if
能记住的事情就是

495
00:15:50,540 --> 00:15:51,540
you are facing a machine learning
假如你正面对

496
00:15:51,830 --> 00:15:54,350
problem, it is often worth doing two things.
一个机器学习问题，有两件事情是值得做的。

497
00:15:54,660 --> 00:15:55,830
One just a sanity check,
一个就是通过学习曲线做一个完整的检查，

498
00:15:56,160 --> 00:15:58,600
with learning curves, that having more data would help.
有更多的数据将会更有力。

499
00:15:59,520 --> 00:16:00,340
And second, assuming that that's the case,
第二个就是，假如情况是这样，

500
00:16:00,730 --> 00:16:01,780
I will often seat down and
我通常会坐下来，

501
00:16:01,850 --> 00:16:03,670
ask yourself seriously: what would
认真的问自己，

502
00:16:04,050 --> 00:16:05,150
it take to get ten times as
获取到十倍于

503
00:16:05,260 --> 00:16:06,510
much creative data as you
当前的创造

504
00:16:06,630 --> 00:16:08,450
currently have, and not always,
出来的数据，不是经常，

505
00:16:08,960 --> 00:16:10,440
but sometimes, you may be
但是有时，你会

506
00:16:10,640 --> 00:16:12,310
surprised by how easy that
吃惊的发现，

507
00:16:12,580 --> 00:16:13,990
turns out to be, maybe
事实上这很简单，可能需要几天，

508
00:16:14,060 --> 00:16:15,020
a few days, a few weeks at
可能是几个周的工作，

509
00:16:15,150 --> 00:16:16,160
work, and that can be
并且这是一个很好的方法，

510
00:16:16,260 --> 00:16:18,700
a great way to give your learning algorithm a huge boost in performance
它可以大幅提高你的机器学习方法的性能。