srt/18 - 2 - Sliding Windows (15 min).srt

﻿1
00:00:00,370 --> 00:00:01,590
In the previous video, we talked
在上一个视频中，我们讨论了（字幕翻译：中国海洋大学，王玺）

2
00:00:01,890 --> 00:00:04,570
about the photo OCR pipeline and how that worked.
OCR管道及其工作原理

3
00:00:05,480 --> 00:00:06,370
In which we would take an image
在OCR管道中我们可以取一张图

4
00:00:07,050 --> 00:00:08,070
and pass the image  Through a
（pass the image，不懂）通过

5
00:00:08,130 --> 00:00:10,010
sequence of machine learning
一系列的机器学习

6
00:00:10,280 --> 00:00:11,680
components in order to
组件来

7
00:00:11,890 --> 00:00:13,820
try to read the text that appears in an image.
尝试读取图片上的文字

8
00:00:14,590 --> 00:00:15,820
In this video I like to tell
今天的视频里，我打算讲

9
00:00:16,210 --> 00:00:17,360
A little bit more about how the
多一点关于流水线

10
00:00:17,780 --> 00:00:20,310
individual components of the pipeline works.
的每个组件的工作原理

11
00:00:21,270 --> 00:00:24,070
In particular most of this video will center around the discussion.
特别的，本视频将会把重点放在讨论

12
00:00:24,680 --> 00:00:25,950
of whats called a sliding windows.
滑动窗口上

13
00:00:26,750 --> 00:00:31,570
The first stage
滤波的

14
00:00:32,000 --> 00:00:33,390
of the filter was the
第一步是

15
00:00:33,730 --> 00:00:35,090
Text detection where we look
确定文字位置，例如我们现在

16
00:00:35,330 --> 00:00:36,640
at an image like this and try
看到的这一幅图片，尝试去

17
00:00:37,020 --> 00:00:39,320
to find the regions of text that appear in this image.
找到图片中文字出现的区域

18
00:00:39,850 --> 00:00:42,490
Text detection is an unusual problem in computer vision.
文字识别对于计算机来说，是一个不寻常的问题

19
00:00:43,220 --> 00:00:44,820
Because depending on the length
因为根据你要尝试

20
00:00:45,140 --> 00:00:46,150
of the text you're trying to
找到的文字的长度

21
00:00:46,290 --> 00:00:47,870
find, these rectangles that you're
这些你要寻找的矩形

22
00:00:47,970 --> 00:00:49,600
trying to find can have different aspect ratios
具有不同的长宽比例

23
00:00:51,100 --> 00:00:52,060
So in order to talk
所以为了讲述如何

24
00:00:52,220 --> 00:00:53,550
about detecting things in images
在图片中发现事物

25
00:00:54,300 --> 00:00:55,860
let's start with a simpler example
我们首先从一个简单点的例子开始

26
00:00:56,550 --> 00:01:00,080
of pedestrian detection and we'll then later go back to apply
即行人检测，然后我们讲如何将

27
00:01:00,460 --> 00:01:02,300
Ideas that were developed
行人检测中的思路用到

28
00:01:02,570 --> 00:01:04,840
in pedestrian detection and apply them to text detection.
文字识别中去

29
00:01:06,280 --> 00:01:08,010
So in pedestrian detection you want
在行人检测中

30
00:01:08,360 --> 00:01:09,440
to take an image that looks
你取一张类似这样的图片

31
00:01:09,600 --> 00:01:11,010
like this and find the
目的就是寻找

32
00:01:11,160 --> 00:01:12,920
individual pedestrians that appear in the image.
图片中的行人

33
00:01:13,260 --> 00:01:14,440
So there's one pedestrian that we
我们找到一个人，

34
00:01:14,520 --> 00:01:15,550
found, there's a second
两个，

35
00:01:15,780 --> 00:01:17,920
one, a third one a fourth one, a fifth one.
三个，四个，五个

36
00:01:18,290 --> 00:01:19,390
And a sixth one.
六个

37
00:01:19,560 --> 00:01:20,990
This problem is maybe slightly
这个问题与文字识别相比，

38
00:01:21,320 --> 00:01:22,770
simpler than text detection just
简单的地方在于：

39
00:01:23,100 --> 00:01:24,200
for the reason that the aspect
你要识别的东西

40
00:01:24,560 --> 00:01:27,490
ratio of most pedestrians are pretty similar.
具有相似的长宽比

41
00:01:28,170 --> 00:01:29,280
Just using a fixed aspect
仅仅使用一个固定

42
00:01:29,630 --> 00:01:31,960
ratio for these rectangles that we're trying to find.
的长宽比就基本可以了

43
00:01:32,420 --> 00:01:33,610
So by aspect ratio I mean
aspect ratio意思是

44
00:01:33,920 --> 00:01:36,420
the ratio between the height and the width of these rectangles.
矩形的高度和宽度之比

45
00:01:37,820 --> 00:01:38,190
They're all the same.
它们都是一样的

46
00:01:38,650 --> 00:01:40,120
for different pedestrians but for
对于不同的行人来说，但是

47
00:01:40,490 --> 00:01:42,650
text detection the height
对于文字来说

48
00:01:43,030 --> 00:01:44,560
and width ratio is different
不同行的文字

49
00:01:44,960 --> 00:01:45,830
for different lines of text
具有不同的比例

50
00:01:46,460 --> 00:01:47,940
Although for pedestrian detection, the
对于行人检测，尽管行人距离

51
00:01:48,020 --> 00:01:49,250
pedestrians can be different distances
摄像头的距离可能

52
00:01:49,810 --> 00:01:51,250
away from the camera and
不同，因此

53
00:01:51,390 --> 00:01:52,730
so the height of these rectangles
矩形的高度

54
00:01:53,380 --> 00:01:55,600
can be different depending on how far away they are.
不一致

55
00:01:55,990 --> 00:01:57,090
but the aspect ratio is the same.
但比例还是维持不变的

56
00:01:57,720 --> 00:01:58,880
In order to build a pedestrian
为了建立一个行人检测系统

57
00:01:59,440 --> 00:02:02,460
detection system here's how you can go about it.
你需要这么做

58
00:02:02,520 --> 00:02:03,650
Let's say that we decide to
例如我们决定要

59
00:02:03,970 --> 00:02:06,100
standardize on this aspect
使用82*36的比例

60
00:02:06,690 --> 00:02:08,010
ratio of 82 by 36
来进行标准化

61
00:02:08,180 --> 00:02:10,040
and we could
当然我们也可以

62
00:02:10,330 --> 00:02:11,510
have chosen some rounded number
选择一些近似的数字

63
00:02:12,020 --> 00:02:14,000
like 80 by 40 or something, but 82 by 36 seems alright.
比如80*40，但82*36看上去是可行的

64
00:02:16,110 --> 00:02:17,280
What we would do is then go
我们将要做的是

65
00:02:17,650 --> 00:02:20,420
out and collect large training sets of positive and negative examples.
出去搜集一些正例和反例

66
00:02:21,240 --> 00:02:22,790
Here are examples of 82
这里有一些

67
00:02:22,900 --> 00:02:24,230
X 36 image patches that do
符合比例的图片

68
00:02:24,360 --> 00:02:26,230
contain pedestrians and here are
以及一些不符合比例

69
00:02:26,550 --> 00:02:28,360
examples of images that do not.
的图片

70
00:02:29,470 --> 00:02:30,710
On this slide I show 12
在这个幻灯片里我展示了12个

71
00:02:31,050 --> 00:02:33,170
positive examples of y=1
正例，用y=1表示

72
00:02:33,730 --> 00:02:34,990
and 12 examples of y=0.
12个反例用y=0表示

73
00:02:36,410 --> 00:02:37,790
In a more typical pedestrian detection
在一个更典型的行人检测应用中，

74
00:02:38,180 --> 00:02:39,200
application, we may have
我们可以会有

75
00:02:39,500 --> 00:02:40,880
anywhere from a 1,000 training
从1000

76
00:02:41,230 --> 00:02:42,210
examples up to maybe
到10000

77
00:02:42,300 --> 00:02:44,410
10,000 training examples, or
个数目的例子，或者

78
00:02:44,460 --> 00:02:45,360
even more if you can
更多，如果你能够

79
00:02:45,510 --> 00:02:47,180
get even larger training sets.
获取到更大的训练集合

80
00:02:47,460 --> 00:02:48,590
And what you can do, is then train
然后，你可以

81
00:02:48,910 --> 00:02:50,160
in your network or some
在你的网络中训练，或者使用

82
00:02:50,510 --> 00:02:52,420
other learning algorithm to
其他学习算法

83
00:02:52,610 --> 00:02:54,570
take this input, an image
来接收这个输入，一个82*36

84
00:02:54,970 --> 00:02:56,710
patch of dimension 82 by
的小图块

85
00:02:56,850 --> 00:02:59,180
36, and to classify  'y'
来划分y

86
00:02:59,710 --> 00:03:01,070
and to classify that image patch
来划分每个图块是否

87
00:03:01,510 --> 00:03:03,850
as either containing a pedestrian or not.
包含一个行人

88
00:03:05,250 --> 00:03:06,250
So this gives you a way
So 这给了你一个

89
00:03:06,470 --> 00:03:08,050
of applying supervised learning in
应用监督学习的方法

90
00:03:08,210 --> 00:03:09,290
order to take an image
来对一个图块进行处理

91
00:03:09,530 --> 00:03:12,420
patch can determine whether or not a pedestrian appears in that image capture.
判断其是否包含有行人

92
00:03:14,310 --> 00:03:15,190
Now, lets say we get
现在，假设我们得到

93
00:03:15,400 --> 00:03:16,520
a new image, a test set
一个新的图片，一个测试集合

94
00:03:16,850 --> 00:03:17,920
image like this and we
图片（类似这个）

95
00:03:18,030 --> 00:03:20,240
want to try to find a pedestrian's picture image.
我们尝试寻找一个行人的图片

96
00:03:21,520 --> 00:03:22,340
What we would do is start
我们首先

97
00:03:22,670 --> 00:03:25,140
by taking a rectangular patch of this image.
在图片中选取一个矩形块

98
00:03:25,580 --> 00:03:26,800
Like that shown up here, so
像这里标注的，

99
00:03:26,900 --> 00:03:27,930
that's maybe a 82 X
例如这是图片中的一个

100
00:03:28,010 --> 00:03:29,440
36 patch of this image,
82*36的图块

101
00:03:30,270 --> 00:03:31,530
and run that image patch through
在我们的分类器里

102
00:03:31,830 --> 00:03:33,660
our classifier to determine whether
运行这个图块，验证

103
00:03:33,840 --> 00:03:34,900
or not there is a
是否图块中

104
00:03:34,980 --> 00:03:36,310
pedestrian in that image patch,
是否有行人

105
00:03:36,620 --> 00:03:38,100
and hopefully our classifier will
期望我们的分类器返回

106
00:03:38,260 --> 00:03:40,600
return y equals 0 for that patch, since there is no pedestrian.
0或者1，对应是否有行人

107
00:03:42,020 --> 00:03:42,900
Next, we then take that green
接下来，我们将

108
00:03:43,140 --> 00:03:44,380
rectangle and we slide it
绿色矩形

109
00:03:44,490 --> 00:03:45,680
over a bit and then
滑动一点

110
00:03:45,940 --> 00:03:47,180
run that new image patch
然后通过

111
00:03:47,560 --> 00:03:49,700
through our classifier to decide if there's a pedestrian there.
我们的分类器来决定是否有行人。

112
00:03:50,760 --> 00:03:51,740
And having done that, we then
完成后，我们

113
00:03:51,920 --> 00:03:53,070
slide the window further to the
滑动窗口向右

114
00:03:53,160 --> 00:03:54,160
right and run that patch
再次

115
00:03:54,420 --> 00:03:56,690
through the classifier again.
运行分类器

116
00:03:56,970 --> 00:03:57,850
The amount by which you shift
每次矩形

117
00:03:58,280 --> 00:03:59,770
the rectangle over each time
移动距离

118
00:04:00,260 --> 00:04:01,720
is a parameter, that's sometimes
是一个参数，有时

119
00:04:02,190 --> 00:04:04,000
called the step size of the
称之为步长

120
00:04:04,070 --> 00:04:06,020
parameter, sometimes also called
有时也被称为

121
00:04:06,380 --> 00:04:08,970
the slide parameter, and if
滑动参数，如果

122
00:04:09,120 --> 00:04:11,050
you step this one pixel at a time.
你一次移动一个像素

123
00:04:11,210 --> 00:04:12,020
So you can use the step size
所以你可以使用步长

124
00:04:12,360 --> 00:04:14,020
or stride of 1, that usually
为1，通常

125
00:04:14,340 --> 00:04:15,560
performs best, but is
表现最好，但

126
00:04:15,700 --> 00:04:16,960
more computational  expensive, and
计算成本较高，如果

127
00:04:17,430 --> 00:04:18,940
so using a step size of
使用步长

128
00:04:19,090 --> 00:04:20,010
maybe 4 pixels at a
为4像素

129
00:04:20,210 --> 00:04:20,970
time, or eight pixels at a
或8像素

130
00:04:21,250 --> 00:04:22,350
time or some large number of
或一些更大的数

131
00:04:22,550 --> 00:04:23,600
pixels might be more common,
可能更常见

132
00:04:24,010 --> 00:04:25,320
since you're then moving the
因为你每次

133
00:04:25,430 --> 00:04:26,570
rectangle a little bit
你移动的距离

134
00:04:26,700 --> 00:04:28,570
more each time.
可以更大

135
00:04:28,870 --> 00:04:30,090
So, using this process, you continue
所以，使用这个程序，你继续

136
00:04:30,870 --> 00:04:32,310
stepping the rectangle over to
向右移动矩形

137
00:04:32,340 --> 00:04:33,160
the right a bit at a
每次一点点距离

138
00:04:33,370 --> 00:04:34,450
time and running each of
然后运行分类器

139
00:04:34,520 --> 00:04:35,780
these patches through a classifier,
对图块进行分类

140
00:04:36,620 --> 00:04:38,220
until eventually, as you
直到最后，随着

141
00:04:38,900 --> 00:04:42,080
slide this window over the
你在图片的不同位置

142
00:04:42,150 --> 00:04:43,340
different locations in the image,
滑动这个矩形

143
00:04:43,550 --> 00:04:44,680
first starting with the first
首先从第一行

144
00:04:44,850 --> 00:04:46,080
row and then we
然后我们

145
00:04:46,160 --> 00:04:47,580
go further rows in
滑动到下一行

146
00:04:47,710 --> 00:04:49,100
the image, you would
你使用某个某个步长

147
00:04:49,290 --> 00:04:50,490
then run all of
对这些不同的图块

148
00:04:50,550 --> 00:04:52,070
these different image patches at
应用某个步长

149
00:04:52,240 --> 00:04:53,330
some step size or some
通过分类器

150
00:04:53,430 --> 00:04:54,990
stride through your classifier.
进行分类

151
00:04:56,990 --> 00:04:57,870
Now, that was a pretty
现在，这是一个相当

152
00:04:57,970 --> 00:04:59,870
small rectangle, that would only
小的矩形，这只会

153
00:05:00,310 --> 00:05:02,310
detect pedestrians of one specific size.
检测一个特定大小的行人。

154
00:05:02,780 --> 00:05:04,210
What we do next is
接下来我们做什么

155
00:05:04,470 --> 00:05:05,990
start to look at larger image patches.
开始使用更大的图块

156
00:05:06,730 --> 00:05:08,270
So now let's take larger images
现在让我们以更大的图片

157
00:05:08,610 --> 00:05:09,700
patches, like those shown here
块，如图

158
00:05:10,310 --> 00:05:11,960
and run those through the classifier as well.
然后也通过分类器运行

159
00:05:13,540 --> 00:05:14,320
And by the way when I say
当我说

160
00:05:14,600 --> 00:05:15,830
take a larger image patch, what
以较大的图块，

161
00:05:16,080 --> 00:05:17,780
I really mean is when you
我的意思是当你

162
00:05:17,860 --> 00:05:18,850
take an image patch like this,
选取这样的图块，

163
00:05:19,490 --> 00:05:20,720
what you're really doing is taking
你真正做的是

164
00:05:20,880 --> 00:05:22,110
that image patch, and resizing
选择图像块，并调整大小

165
00:05:22,800 --> 00:05:24,750
it down to 82 X 36, say.
下降到82×36，

166
00:05:25,000 --> 00:05:26,260
So you take this larger
所以你拿这个更大的

167
00:05:26,550 --> 00:05:28,180
patch and re-size it to
块和调整其大小

168
00:05:28,300 --> 00:05:29,800
be a smaller image and then
成为更小的图，然后

169
00:05:29,970 --> 00:05:31,260
the smaller re-sized image
用这个图块

170
00:05:31,600 --> 00:05:32,620
that is what you
在分类器中

171
00:05:32,990 --> 00:05:35,340
would pass through your classifier to try and decide if there is a pedestrian in that patch.
运行，然后决定是否有行人。

172
00:05:37,230 --> 00:05:38,310
And finally you can do
最后你可以

173
00:05:38,470 --> 00:05:39,530
this at an even larger
在一个更大

174
00:05:39,930 --> 00:05:41,870
scales and run
规模做这一步

175
00:05:42,080 --> 00:05:43,830
that side of Windows to
运行滑动窗口知直到

176
00:05:43,980 --> 00:05:45,920
the end And after
结束，经过

177
00:05:45,980 --> 00:05:47,480
this whole process hopefully your algorithm
这个过程，希望你的算法

178
00:05:48,040 --> 00:05:49,670
will detect whether theres pedestrian
将检测到是否有行人

179
00:05:50,140 --> 00:05:52,070
appears in the image, so
在图中出现，所以

180
00:05:52,470 --> 00:05:53,850
thats how you train a
这就是你如何训练一个

181
00:05:54,290 --> 00:05:55,630
the classifier, and then
分类器，然后

182
00:05:55,890 --> 00:05:57,360
use a sliding windows classifier,
使用滑动窗口分类，

183
00:05:57,920 --> 00:05:59,820
or use a sliding windows detector in
或使用一个滑动窗口检测器

184
00:05:59,970 --> 00:06:01,740
order to find pedestrians in the image.
去寻找图像中的行人。

185
00:06:03,070 --> 00:06:04,050
Let's have a turn to the
让我们转向

186
00:06:04,150 --> 00:06:05,910
text detection example and talk
文本检测的例子，讨论

187
00:06:06,100 --> 00:06:07,490
about that stage in our
那个阶段，在我们

188
00:06:07,790 --> 00:06:09,330
photo OCR pipeline, where our
的照片OCR管道，我们

189
00:06:09,570 --> 00:06:11,340
goal is to find the text regions in unit.
的目标是找到一个个的文本区域。

190
00:06:13,250 --> 00:06:15,010
similar to pedestrian detection you
与行人检测类似，你

191
00:06:15,250 --> 00:06:16,730
can come up with a label
能拿到具有标签的

192
00:06:17,030 --> 00:06:18,410
training set with positive examples
的正例集合

193
00:06:19,060 --> 00:06:20,930
and negative examples with examples
和负例集合

194
00:06:21,530 --> 00:06:23,810
corresponding to regions where text appears.
对应文字出现的区域

195
00:06:24,300 --> 00:06:27,290
So instead of trying to detect pedestrians, we're now trying to detect texts.
所以不再进行行人检测，我们现在尝试检测文本。

196
00:06:28,130 --> 00:06:29,670
And so positive examples are going
正面的样本是

197
00:06:29,770 --> 00:06:31,640
to be patches of images where there is text.
具有文字的图块

198
00:06:31,970 --> 00:06:33,330
And negative examples is going
负面的样本是

199
00:06:33,380 --> 00:06:36,000
to be patches of images where there isn't text.
没有文字的

200
00:06:36,330 --> 00:06:37,530
Having trained this we can
训练完之后我们

201
00:06:38,030 --> 00:06:39,450
now apply it to a
可以将其应用到

202
00:06:39,870 --> 00:06:41,190
new image, into a test
新的图片上，一个测试集

203
00:06:42,460 --> 00:06:42,910
set image.
图片

204
00:06:43,310 --> 00:06:44,900
So here's the image that we've been using as example.
这是我们作为例子的图片

205
00:06:46,040 --> 00:06:47,300
Now, last time we run,
现在，上次我们运行

206
00:06:47,440 --> 00:06:48,400
for this example we are going
这个样本，我们将

207
00:06:48,560 --> 00:06:50,300
to run a sliding windows at
运行一个滑动窗口

208
00:06:50,640 --> 00:06:52,030
just one fixed scale just
在一个固定的比例

209
00:06:52,370 --> 00:06:54,360
for purpose of illustration, meaning that
只是为了进行说明，意味着

210
00:06:54,450 --> 00:06:56,000
I'm going to use just one rectangle size.
我将使用一个矩形的大小。

211
00:06:56,790 --> 00:06:58,110
But lets say I run my little
但如果我

212
00:06:58,350 --> 00:07:00,070
sliding windows classifier on lots
使用这个小的滑动窗口分类器在

213
00:07:00,170 --> 00:07:01,570
of little image patches like
在大量类似这样的小图块上

214
00:07:01,630 --> 00:07:04,340
this if I
如果我

215
00:07:04,430 --> 00:07:05,430
do that, what Ill end
这样做，我将

216
00:07:05,530 --> 00:07:06,670
up with is a result
得到一个结果

217
00:07:07,040 --> 00:07:08,530
like this where the white
像这样，白色

218
00:07:08,900 --> 00:07:10,700
region show where my
区域显示我的

219
00:07:10,940 --> 00:07:12,190
text detection system has found
文本检测系统发现了

220
00:07:12,210 --> 00:07:15,960
text and so the axis' of these two figures are the same.
文本，这两个图的坐标是一样的

221
00:07:16,390 --> 00:07:17,700
So there is a region
有一个这样的区域

222
00:07:18,110 --> 00:07:19,200
up here, of course also
在这里，当然也

223
00:07:19,230 --> 00:07:20,710
a region up here, so the
有一个区域在这里，所以

224
00:07:20,840 --> 00:07:22,040
fact that this black up here
黑色区域

225
00:07:22,850 --> 00:07:24,390
represents that the classifier
表示分类器

226
00:07:24,840 --> 00:07:25,940
does not think it's found any
并不认为它发现

227
00:07:26,170 --> 00:07:28,100
texts up there, whereas the
了文字，而

228
00:07:28,170 --> 00:07:29,630
fact that there's a lot
事实上有很多

229
00:07:29,810 --> 00:07:31,300
of white stuff here, that reflects that
白色，这反映出

230
00:07:31,540 --> 00:07:33,260
classifier thinks that it's found a bunch of texts.
分类器认为发现了一串文本。

231
00:07:33,520 --> 00:07:34,310
over there on the image.
在图上的这个位置

232
00:07:35,040 --> 00:07:35,700
What i have done on this
我在图片左下角

233
00:07:35,780 --> 00:07:36,870
image on the lower left is
做的操作是

234
00:07:37,070 --> 00:07:38,820
actually use white to
其实用白色

235
00:07:38,970 --> 00:07:41,050
show where the classifier thinks it has found text.
显示分类器认为它找到了文字

236
00:07:41,810 --> 00:07:43,280
And different shades of grey
深浅不同的灰色

237
00:07:43,880 --> 00:07:45,560
correspond to the probability that
对应分类器输出

238
00:07:45,670 --> 00:07:46,750
was output by the classifier,
的概率

239
00:07:47,110 --> 00:07:48,000
so like the shades of grey
所以类似灰色阴影

240
00:07:48,520 --> 00:07:49,860
corresponds to where it
对应它

241
00:07:49,930 --> 00:07:50,750
thinks it might have found text
认为它可能找到了文本

242
00:07:51,210 --> 00:07:53,900
but has lower confidence the bright
但是不够确定

243
00:07:54,260 --> 00:07:55,980
white response to whether the classifier,
亮的白色对应

244
00:07:57,440 --> 00:07:58,400
up with a very high
一个很高的

245
00:07:58,660 --> 00:08:00,470
probability, estimated probability of
概率，在那个位置有行人存在

246
00:08:00,630 --> 00:08:03,110
there being pedestrians in that location.
的概率估计

247
00:08:04,110 --> 00:08:05,270
We aren't quite done yet because
我们还没完全结束因为

248
00:08:05,690 --> 00:08:06,580
what we actually want to do
我们真正想做的

249
00:08:06,830 --> 00:08:08,620
is draw rectangles around all
是在所有文字周围

250
00:08:08,850 --> 00:08:09,780
the region where this text
绘制

251
00:08:10,490 --> 00:08:12,540
in the image, so were
矩形，所以

252
00:08:12,650 --> 00:08:13,540
going to take one more step
我们进一步

253
00:08:13,840 --> 00:08:14,990
which is we take the output
输出分类器的

254
00:08:15,230 --> 00:08:16,880
of the classifier and apply
输出和应用

255
00:08:17,290 --> 00:08:19,280
to it what is called an expansion operator.
所谓的膨胀算子

256
00:08:20,750 --> 00:08:22,250
So what that does is, it
那么它所做的是，它

257
00:08:22,430 --> 00:08:24,270
take the image here,
取这里的图，

258
00:08:25,450 --> 00:08:26,700
and it takes each of
它把每个白色块

259
00:08:26,800 --> 00:08:28,200
the white blobs, it takes each
它把每一个

260
00:08:28,270 --> 00:08:30,590
of the white regions and it expands that white region.
白色区域扩大。

261
00:08:31,460 --> 00:08:32,460
Mathematically, the way you
从数学上，你实现它的方式

262
00:08:32,610 --> 00:08:34,110
implement that is, if you
是，如果

263
00:08:34,270 --> 00:08:35,280
look at the image on the right, what
看右边的图像，

264
00:08:35,690 --> 00:08:36,780
we're doing to create the
我们正在创造的

265
00:08:36,930 --> 00:08:38,110
image on the right is, for every
右边的这个图像，对于每一个像素

266
00:08:38,370 --> 00:08:39,510
pixel we are going
我们将要

267
00:08:39,610 --> 00:08:40,790
to ask, is it withing
问，它是否和

268
00:08:41,370 --> 00:08:42,960
some distance of a
左边图中的白色像素

269
00:08:43,100 --> 00:08:44,650
white pixel in the left image.
有一段距离

270
00:08:45,430 --> 00:08:46,800
And so, if a specific pixel
如果是的话，如果一个特定的像素

271
00:08:47,220 --> 00:08:48,420
is within, say, five pixels
距离左边中的

272
00:08:48,950 --> 00:08:50,280
or ten pixels of a white
白色像素，比如

273
00:08:50,610 --> 00:08:52,310
pixel in the leftmost image, then
5个或者10个像素距离，然后

274
00:08:52,540 --> 00:08:55,020
we'll also color that pixel white in the rightmost image.
我们会在右边的图像中，把这个像素也染成白色

275
00:08:56,190 --> 00:08:57,010
And so, the effect of this
这样做的效果是

276
00:08:57,300 --> 00:08:58,350
is, we'll take each of the
把左边图中

277
00:08:58,730 --> 00:08:59,630
white blobs in the leftmost
每个白色区块

278
00:09:00,030 --> 00:09:01,370
image and expand them a
扩大一点

279
00:09:01,500 --> 00:09:02,200
bit, grow them a little
让其增长一点

280
00:09:02,670 --> 00:09:04,110
bit, by seeing whether the
通过检查

281
00:09:04,170 --> 00:09:05,420
nearby pixels, the white pixels,
邻近像素，白色像素

282
00:09:05,900 --> 00:09:07,980
and then coloring those nearby pixels in white as well.
然后把它们也染成白色

283
00:09:08,430 --> 00:09:09,900
Finally, we are just about done.
最后，我们就要完成了

284
00:09:10,180 --> 00:09:11,210
We can now look at this
现在我们可以看看这个

285
00:09:11,480 --> 00:09:12,900
right most image and just
最右边的图像就

286
00:09:13,210 --> 00:09:14,650
look at the connecting components
看连接组件

287
00:09:15,320 --> 00:09:16,700
and look at the as white
看白色

288
00:09:16,990 --> 00:09:19,350
regions and draw bounding boxes around them.
区域，在其周围绘制边框。

289
00:09:20,260 --> 00:09:20,990
And in particular, if we look at
特别的，如果我们看一看

290
00:09:21,390 --> 00:09:22,850
all the white regions, like
所有的白色区域，比如

291
00:09:23,080 --> 00:09:24,750
this one, this one, this
这个，这个，这个

292
00:09:24,990 --> 00:09:26,670
one, and so on, and
，等等

293
00:09:27,030 --> 00:09:27,810
if we use a simple heuristic
如果我们用一个简单的启发式方法

294
00:09:28,390 --> 00:09:30,240
to rule out rectangles whose aspect
来排除那些比例

295
00:09:30,660 --> 00:09:32,760
ratios look funny because we
滑稽的矩形，因为我们

296
00:09:32,870 --> 00:09:34,460
know that boxes around text
知道文本周围的框

297
00:09:34,730 --> 00:09:36,130
should be much wider than they are tall.
应该是宽远大于高

298
00:09:37,110 --> 00:09:38,310
And so if we ignore the
所以如果我们忽略

299
00:09:38,410 --> 00:09:39,990
thin, tall blobs like this one
窄的，高的图块，像这个，

300
00:09:40,230 --> 00:09:42,120
and this one, and
和这一个

301
00:09:42,190 --> 00:09:43,390
we discard these ones because
我们丢弃这些

302
00:09:43,880 --> 00:09:45,490
they are too tall and thin, and
因为它们太高，太窄

303
00:09:45,660 --> 00:09:46,780
we then draw a the rectangles
我们在那些

304
00:09:47,470 --> 00:09:48,440
around the ones whose aspect
比例符合文字的

305
00:09:48,840 --> 00:09:50,420
ratio thats a height
图块周围

306
00:09:50,610 --> 00:09:51,800
to what ratio looks like for
绘制矩形框

307
00:09:51,950 --> 00:09:53,310
text regions, then we
然后我们

308
00:09:53,380 --> 00:09:55,070
can draw rectangles, the bounding
可以绘制矩形，包围

309
00:09:55,450 --> 00:09:56,660
boxes around this text
在这些

310
00:09:56,970 --> 00:09:58,500
region, this text region, and
文字区域，

311
00:09:58,610 --> 00:10:00,550
that text region, corresponding to
分别对应

312
00:10:01,060 --> 00:10:02,180
the Lula B's antique mall logo,
Lula Bs 商场标志，

313
00:10:02,650 --> 00:10:04,690
the Lula B's, and this little open sign.
Lula B，以及这个小“Open”的标志

314
00:10:05,840 --> 00:10:06,000
Of over there.
那边

315
00:10:07,100 --> 00:10:09,550
This example by the actually misses one piece of text.
这个例子中其实遗漏了一段文字

316
00:10:09,860 --> 00:10:12,550
This is very hard to read, but there is actually one piece of text there.
这个不好读取，但事实上那边是有一块文字的

317
00:10:13,080 --> 00:10:14,710
That says [xx] are corresponding
这个[ XX ]对应

318
00:10:14,950 --> 00:10:16,180
to this but the aspect ratio
这个区域，但是纵横比

319
00:10:16,530 --> 00:10:17,960
looks wrong so we discarded that one.
看起来是错误，所以我们丢弃掉

320
00:10:19,100 --> 00:10:20,240
So you know it's ok
所以你知道它是好的

321
00:10:20,530 --> 00:10:21,460
on this image, but in
在这个图中，但在

322
00:10:21,660 --> 00:10:22,760
this particular example the classifier
这个特殊的例子中，分类器

323
00:10:23,290 --> 00:10:24,400
actually missed one piece of text.
遗漏了一段文字。

324
00:10:24,760 --> 00:10:25,780
It's very hard to read because
这是很难读的，因为

325
00:10:25,960 --> 00:10:26,900
there's a piece of text
有一段文字

326
00:10:27,240 --> 00:10:28,700
written against a transparent window.
写在了透明的窗户上

327
00:10:29,750 --> 00:10:31,200
So that's text detection
那么这就是文本检测

328
00:10:32,430 --> 00:10:33,120
using sliding windows.
使用滑动窗口

329
00:10:33,800 --> 00:10:35,300
And having found these rectangles
发现包含文字

330
00:10:36,100 --> 00:10:37,010
with the text in it, we
的矩形，我们

331
00:10:37,110 --> 00:10:38,240
can now just cut out
现在可以切出

332
00:10:38,450 --> 00:10:39,890
these image regions and then
这些图像区域，然后

333
00:10:40,070 --> 00:10:42,100
use later stages of pipeline to try to meet the texts.
使用后期的管道来满足文本。

334
00:10:45,390 --> 00:10:46,820
Now, you recall that the
现在，你回忆一下

335
00:10:46,880 --> 00:10:48,360
second stage of pipeline was
管道的第二步是

336
00:10:48,570 --> 00:10:50,620
character segmentation, so given an
字符分割，所以给一个

337
00:10:50,890 --> 00:10:52,530
image like that shown on top,
如上部所示的图片

338
00:10:52,790 --> 00:10:55,660
how do we segment out the individual characters in this image?
我们如何分割出此图像中的单个字符？

339
00:10:56,580 --> 00:10:57,460
So what we can do is
所以我们所能做的就是

340
00:10:57,910 --> 00:10:59,590
again use a supervised learning
再次使用有监督的学习

341
00:11:00,010 --> 00:11:01,020
algorithm with some set of
算法，以及一些

342
00:11:01,100 --> 00:11:01,990
positive and some set of
正例和反例

343
00:11:02,100 --> 00:11:03,810
negative examples, what were
我们将要做的是

344
00:11:03,880 --> 00:11:04,840
going to do is look in
检查这些

345
00:11:04,900 --> 00:11:06,160
the image patch and try
图像块并且尝试

346
00:11:06,390 --> 00:11:08,110
to decide if there
决定在图块的

347
00:11:08,370 --> 00:11:09,690
is split between two characters
的中间是否存在

348
00:11:10,700 --> 00:11:12,070
right in the middle of that image match.
两个字符的分割

349
00:11:13,030 --> 00:11:14,100
So for initial positive examples.
所以对于初始的正例

350
00:11:14,960 --> 00:11:17,040
This first cross example, this image
第一跨的样本，这个图块

351
00:11:17,290 --> 00:11:18,590
patch looks like the
看起来

352
00:11:18,650 --> 00:11:20,050
middle of it is indeed
中间的确

353
00:11:21,320 --> 00:11:22,890
the middle has splits between two
存在两个字符的分割

354
00:11:23,110 --> 00:11:24,120
characters and the second example
第二例

355
00:11:24,680 --> 00:11:25,770
again this looks like a
仍然看起来像一个

356
00:11:25,950 --> 00:11:27,370
positive example, because if I split
正面的样本，因为如果我

357
00:11:27,840 --> 00:11:29,020
two characters by putting a
在正中间放一条线

358
00:11:29,160 --> 00:11:31,190
line right down the middle, that's the right thing to do.
来分割字符，这样做是正确的。

359
00:11:31,350 --> 00:11:33,310
So, these are positive examples, where
等，这些都是正面的样本，

360
00:11:33,510 --> 00:11:35,370
the middle of the image represents
图像的中间表示

361
00:11:35,970 --> 00:11:36,930
a gap or a split
两个字符之间的

362
00:11:37,960 --> 00:11:40,320
between two distinct characters, whereas
沟壑或分割，而

363
00:11:40,560 --> 00:11:41,870
the negative examples, well, you
负面的样本，很好，你

364
00:11:42,010 --> 00:11:43,160
know, you don't want to split
知道，你不想分裂

365
00:11:43,690 --> 00:11:44,810
two characters right in the
把两个字从中间

366
00:11:44,900 --> 00:11:46,610
middle, and so
分割，所以

367
00:11:46,820 --> 00:11:48,160
these are negative examples because
这些都是负面的样本，因为

368
00:11:48,460 --> 00:11:50,660
they don't represent the midpoint between two characters.
不代表两个字符之间的中点。

369
00:11:51,760 --> 00:11:52,490
So what we will do
所以我们要做的

370
00:11:52,650 --> 00:11:53,940
is, we will train a classifier,
是：我们会训练一个分类器，

371
00:11:54,500 --> 00:11:55,910
maybe using new network, maybe
也许利用新的网络，也许

372
00:11:56,180 --> 00:11:58,000
using a different learning algorithm, to
使用不同的学习算法，

373
00:11:58,120 --> 00:12:01,420
try to classify between the positive and negative examples.
尝试将正面和负面的样本进行区分。

374
00:12:02,770 --> 00:12:03,980
Having trained such a classifier,
有这样一个分类器的训练，

375
00:12:04,320 --> 00:12:06,030
we can then run this on
我们可以在

376
00:12:06,690 --> 00:12:07,830
this sort of text that our
我们的文字检测系统中

377
00:12:07,940 --> 00:12:09,410
text detection system has pulled out.
分离出的文字上运行这个分类器

378
00:12:09,590 --> 00:12:10,970
As we start by looking at
我们先来看看

379
00:12:11,130 --> 00:12:12,080
that rectangle, and we ask,
那个矩形，我们问，

380
00:12:12,230 --> 00:12:13,280
"Gee, does it look
“哎呀，它看起来

381
00:12:13,510 --> 00:12:15,000
like the middle of
像那个绿色矩形的

382
00:12:15,100 --> 00:12:16,600
that green rectangle, does it
中间？，它

383
00:12:16,680 --> 00:12:18,470
look like the midpoint between two characters?".
看起来像两个字符之间的中点？”。

384
00:12:18,980 --> 00:12:20,220
And hopefully, the classifier will
希望，分类器

385
00:12:20,320 --> 00:12:21,760
say no, then we slide
说不，然后我们滑动

386
00:12:22,170 --> 00:12:23,280
the window over and this
窗户，这是

387
00:12:23,410 --> 00:12:24,850
is a one dimensional sliding
一维的滑动

388
00:12:25,200 --> 00:12:26,410
window classifier, because were
窗口分类器，因为

389
00:12:26,500 --> 00:12:27,820
going to slide the window only
将滑动窗口

390
00:12:28,470 --> 00:12:29,560
in one straight line from
在一条直线上

391
00:12:29,780 --> 00:12:32,070
left to right, theres no different rows here.
从左到右，这时没有不同的行

392
00:12:32,270 --> 00:12:34,420
There's only one row here.
只有一行

393
00:12:34,520 --> 00:12:36,160
But now, with the classifier in
但是，在这个位置时，

394
00:12:36,240 --> 00:12:37,250
this position, we ask, well,
我们使用分类器，我们问，好，

395
00:12:37,490 --> 00:12:38,700
should we split those two characters
我们应该分开这两个字符

396
00:12:39,570 --> 00:12:41,580
or should we put a split right down the middle of this rectangle.
或者我们应该在这个矩形的中间进行分割

397
00:12:41,950 --> 00:12:43,040
And hopefully, the classifier will
希望，分类器

398
00:12:43,190 --> 00:12:44,720
output y equals one, in
输出Y=1，在

399
00:12:44,780 --> 00:12:46,460
which case we will decide to
这种情况下我们会决定

400
00:12:46,630 --> 00:12:49,690
draw a line down there, to try to split two characters.
画一条线在那里，试图分割这两个字符

401
00:12:50,710 --> 00:12:51,620
Then we slide the window over
然后再次滑动窗口

402
00:12:51,870 --> 00:12:53,440
again, classifier says, don't
分类器说，不

403
00:12:53,650 --> 00:12:55,020
close the gap, slide over again,
关闭间隙，再次滑过，

404
00:12:55,300 --> 00:12:56,580
classifier says yes, do split
分类器说yes，进行分割

405
00:12:57,230 --> 00:12:58,830
there and so
等等

406
00:12:59,200 --> 00:13:00,410
on, and we slowly slide the
然后慢慢地滑动

407
00:13:00,560 --> 00:13:01,770
classifier over to the
分类器向

408
00:13:01,920 --> 00:13:03,310
right and hopefully it will
右边，希望它可以将

409
00:13:03,380 --> 00:13:05,160
classify this as another positive example and
这个作为另一个正面的样本，

410
00:13:05,770 --> 00:13:07,470
so on.
等等

411
00:13:08,010 --> 00:13:09,180
And we will slide this window
我们向右滑动每一步

412
00:13:09,820 --> 00:13:10,990
over to the right, running the
运行分类器

413
00:13:11,160 --> 00:13:12,670
classifier at every step, and
希望它可以

414
00:13:12,800 --> 00:13:13,800
hopefully it will tell us,
告诉我们，

415
00:13:14,210 --> 00:13:15,070
you know, what are the right locations
你知道，应该在哪里

416
00:13:16,190 --> 00:13:17,820
to split these characters up into,
对字符进行分割

417
00:13:18,290 --> 00:13:20,410
just split this image up into individual characters.
将这个图像分成单个字符

418
00:13:21,090 --> 00:13:22,450
And so thats 1D sliding
这就是一维滑动

419
00:13:22,810 --> 00:13:24,190
windows for character segmentation.
窗口字符分割

420
00:13:25,520 --> 00:13:28,430
So, here's the overall photo OCR pipe line again.
所以，这里是OCR照片管道的全部

421
00:13:29,120 --> 00:13:30,280
In this video we've talked about
在这段视频中我们已经谈到了

422
00:13:30,780 --> 00:13:32,170
the text detection step, where
文本检测步骤，其中

423
00:13:32,360 --> 00:13:34,570
we use sliding windows to detect text.
我们使用滑动窗口检测文本。

424
00:13:35,200 --> 00:13:36,390
And we also use a one-dimensional
我们也使用一个一维的

425
00:13:37,070 --> 00:13:38,420
sliding windows to do character
滑动窗来做

426
00:13:38,790 --> 00:13:40,160
segmentation to segment out,
字符分割

427
00:13:40,730 --> 00:13:42,860
you know, this text image in division of characters.
你知道，这个文本图像的字符分割。

428
00:13:43,900 --> 00:13:44,770
The final step through the
通过管道的最后一步

429
00:13:44,810 --> 00:13:46,040
pipeline is the character classification
是字符分类

430
00:13:46,720 --> 00:13:48,150
step and that step you might
步骤和这一步你或许

431
00:13:48,370 --> 00:13:49,750
already be much more familiar
更加熟悉

432
00:13:50,020 --> 00:13:51,490
with the early videos
与早期的关于

433
00:13:52,080 --> 00:13:54,470
on supervised learning
监督学习的视频

434
00:13:55,170 --> 00:13:56,440
where you can apply a standard
你可以应用一个标准的

435
00:13:56,940 --> 00:13:58,150
supervised learning within maybe
监督学习也许

436
00:13:58,360 --> 00:13:59,250
on your network or maybe something
在你的网络上或者一些

437
00:13:59,570 --> 00:14:00,650
else in order to
其他的，为了

438
00:14:00,860 --> 00:14:02,100
take it's input, an image
把它的输入，类似

439
00:14:02,980 --> 00:14:05,030
like that and classify which alphabet
这样的图片，将其

440
00:14:05,480 --> 00:14:07,120
or which 26 characters A
分类到26个字母

441
00:14:07,230 --> 00:14:08,320
to Z, or maybe we should
或者36个，

442
00:14:08,570 --> 00:14:09,670
have 36 characters if you
如果包含

443
00:14:09,780 --> 00:14:11,140
have the numerical digits as
数字的话

444
00:14:11,270 --> 00:14:12,650
well, the multi class
多

445
00:14:13,080 --> 00:14:14,410
classification problem where you
分类器问题

446
00:14:14,510 --> 00:14:15,690
take it's input and image
它的输入

447
00:14:16,050 --> 00:14:17,390
contained a character and decide
一个包含文字的图片

448
00:14:18,140 --> 00:14:20,450
what is the character that appears in that image?
出现在图像中的字符是什么？

449
00:14:21,080 --> 00:14:22,460
So that was the photo OCR
那么这就是照片OCR

450
00:14:23,730 --> 00:14:24,750
pipeline and how you can
以及你如何

451
00:14:24,910 --> 00:14:26,140
use ideas like sliding windows
使用滑动窗口

452
00:14:26,520 --> 00:14:27,960
classifiers in order to
分类器来

453
00:14:28,100 --> 00:14:29,790
put these different components to
使用这些不同的组件

454
00:14:30,060 --> 00:14:31,570
develop a photo OCR system.
开发照片OCR系统

455
00:14:32,430 --> 00:14:33,570
In the next few videos we
在接下来的几个视频我们

456
00:14:33,680 --> 00:14:34,930
keep on using the problem of
继续使用照片OCR问题

457
00:14:35,150 --> 00:14:36,550
photo OCR to explore somewhat
探索一些

458
00:14:36,960 --> 00:14:39,070
interesting issues surrounding building an application like this.
有趣的问题，建立类似的应用。