forked from fengdu78/Coursera-ML-AndrewNg-Notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path18 - 2 - Sliding Windows (15 min).srt
2290 lines (1832 loc) · 40.9 KB
/
18 - 2 - Sliding Windows (15 min).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:00,370 --> 00:00:01,590
In the previous video, we talked
在上一个视频中,我们讨论了(字幕翻译:中国海洋大学,王玺)
2
00:00:01,890 --> 00:00:04,570
about the photo OCR pipeline and how that worked.
OCR管道及其工作原理
3
00:00:05,480 --> 00:00:06,370
In which we would take an image
在OCR管道中我们可以取一张图
4
00:00:07,050 --> 00:00:08,070
and pass the image Through a
(pass the image,不懂)通过
5
00:00:08,130 --> 00:00:10,010
sequence of machine learning
一系列的机器学习
6
00:00:10,280 --> 00:00:11,680
components in order to
组件来
7
00:00:11,890 --> 00:00:13,820
try to read the text that appears in an image.
尝试读取图片上的文字
8
00:00:14,590 --> 00:00:15,820
In this video I like to tell
今天的视频里,我打算讲
9
00:00:16,210 --> 00:00:17,360
A little bit more about how the
多一点关于流水线
10
00:00:17,780 --> 00:00:20,310
individual components of the pipeline works.
的每个组件的工作原理
11
00:00:21,270 --> 00:00:24,070
In particular most of this video will center around the discussion.
特别的,本视频将会把重点放在讨论
12
00:00:24,680 --> 00:00:25,950
of whats called a sliding windows.
滑动窗口上
13
00:00:26,750 --> 00:00:31,570
The first stage
滤波的
14
00:00:32,000 --> 00:00:33,390
of the filter was the
第一步是
15
00:00:33,730 --> 00:00:35,090
Text detection where we look
确定文字位置,例如我们现在
16
00:00:35,330 --> 00:00:36,640
at an image like this and try
看到的这一幅图片,尝试去
17
00:00:37,020 --> 00:00:39,320
to find the regions of text that appear in this image.
找到图片中文字出现的区域
18
00:00:39,850 --> 00:00:42,490
Text detection is an unusual problem in computer vision.
文字识别对于计算机来说,是一个不寻常的问题
19
00:00:43,220 --> 00:00:44,820
Because depending on the length
因为根据你要尝试
20
00:00:45,140 --> 00:00:46,150
of the text you're trying to
找到的文字的长度
21
00:00:46,290 --> 00:00:47,870
find, these rectangles that you're
这些你要寻找的矩形
22
00:00:47,970 --> 00:00:49,600
trying to find can have different aspect ratios
具有不同的长宽比例
23
00:00:51,100 --> 00:00:52,060
So in order to talk
所以为了讲述如何
24
00:00:52,220 --> 00:00:53,550
about detecting things in images
在图片中发现事物
25
00:00:54,300 --> 00:00:55,860
let's start with a simpler example
我们首先从一个简单点的例子开始
26
00:00:56,550 --> 00:01:00,080
of pedestrian detection and we'll then later go back to apply
即行人检测,然后我们讲如何将
27
00:01:00,460 --> 00:01:02,300
Ideas that were developed
行人检测中的思路用到
28
00:01:02,570 --> 00:01:04,840
in pedestrian detection and apply them to text detection.
文字识别中去
29
00:01:06,280 --> 00:01:08,010
So in pedestrian detection you want
在行人检测中
30
00:01:08,360 --> 00:01:09,440
to take an image that looks
你取一张类似这样的图片
31
00:01:09,600 --> 00:01:11,010
like this and find the
目的就是寻找
32
00:01:11,160 --> 00:01:12,920
individual pedestrians that appear in the image.
图片中的行人
33
00:01:13,260 --> 00:01:14,440
So there's one pedestrian that we
我们找到一个人,
34
00:01:14,520 --> 00:01:15,550
found, there's a second
两个,
35
00:01:15,780 --> 00:01:17,920
one, a third one a fourth one, a fifth one.
三个,四个,五个
36
00:01:18,290 --> 00:01:19,390
And a sixth one.
六个
37
00:01:19,560 --> 00:01:20,990
This problem is maybe slightly
这个问题与文字识别相比,
38
00:01:21,320 --> 00:01:22,770
simpler than text detection just
简单的地方在于:
39
00:01:23,100 --> 00:01:24,200
for the reason that the aspect
你要识别的东西
40
00:01:24,560 --> 00:01:27,490
ratio of most pedestrians are pretty similar.
具有相似的长宽比
41
00:01:28,170 --> 00:01:29,280
Just using a fixed aspect
仅仅使用一个固定
42
00:01:29,630 --> 00:01:31,960
ratio for these rectangles that we're trying to find.
的长宽比就基本可以了
43
00:01:32,420 --> 00:01:33,610
So by aspect ratio I mean
aspect ratio意思是
44
00:01:33,920 --> 00:01:36,420
the ratio between the height and the width of these rectangles.
矩形的高度和宽度之比
45
00:01:37,820 --> 00:01:38,190
They're all the same.
它们都是一样的
46
00:01:38,650 --> 00:01:40,120
for different pedestrians but for
对于不同的行人来说,但是
47
00:01:40,490 --> 00:01:42,650
text detection the height
对于文字来说
48
00:01:43,030 --> 00:01:44,560
and width ratio is different
不同行的文字
49
00:01:44,960 --> 00:01:45,830
for different lines of text
具有不同的比例
50
00:01:46,460 --> 00:01:47,940
Although for pedestrian detection, the
对于行人检测,尽管行人距离
51
00:01:48,020 --> 00:01:49,250
pedestrians can be different distances
摄像头的距离可能
52
00:01:49,810 --> 00:01:51,250
away from the camera and
不同,因此
53
00:01:51,390 --> 00:01:52,730
so the height of these rectangles
矩形的高度
54
00:01:53,380 --> 00:01:55,600
can be different depending on how far away they are.
不一致
55
00:01:55,990 --> 00:01:57,090
but the aspect ratio is the same.
但比例还是维持不变的
56
00:01:57,720 --> 00:01:58,880
In order to build a pedestrian
为了建立一个行人检测系统
57
00:01:59,440 --> 00:02:02,460
detection system here's how you can go about it.
你需要这么做
58
00:02:02,520 --> 00:02:03,650
Let's say that we decide to
例如我们决定要
59
00:02:03,970 --> 00:02:06,100
standardize on this aspect
使用82*36的比例
60
00:02:06,690 --> 00:02:08,010
ratio of 82 by 36
来进行标准化
61
00:02:08,180 --> 00:02:10,040
and we could
当然我们也可以
62
00:02:10,330 --> 00:02:11,510
have chosen some rounded number
选择一些近似的数字
63
00:02:12,020 --> 00:02:14,000
like 80 by 40 or something, but 82 by 36 seems alright.
比如80*40,但82*36看上去是可行的
64
00:02:16,110 --> 00:02:17,280
What we would do is then go
我们将要做的是
65
00:02:17,650 --> 00:02:20,420
out and collect large training sets of positive and negative examples.
出去搜集一些正例和反例
66
00:02:21,240 --> 00:02:22,790
Here are examples of 82
这里有一些
67
00:02:22,900 --> 00:02:24,230
X 36 image patches that do
符合比例的图片
68
00:02:24,360 --> 00:02:26,230
contain pedestrians and here are
以及一些不符合比例
69
00:02:26,550 --> 00:02:28,360
examples of images that do not.
的图片
70
00:02:29,470 --> 00:02:30,710
On this slide I show 12
在这个幻灯片里我展示了12个
71
00:02:31,050 --> 00:02:33,170
positive examples of y=1
正例,用y=1表示
72
00:02:33,730 --> 00:02:34,990
and 12 examples of y=0.
12个反例用y=0表示
73
00:02:36,410 --> 00:02:37,790
In a more typical pedestrian detection
在一个更典型的行人检测应用中,
74
00:02:38,180 --> 00:02:39,200
application, we may have
我们可以会有
75
00:02:39,500 --> 00:02:40,880
anywhere from a 1,000 training
从1000
76
00:02:41,230 --> 00:02:42,210
examples up to maybe
到10000
77
00:02:42,300 --> 00:02:44,410
10,000 training examples, or
个数目的例子,或者
78
00:02:44,460 --> 00:02:45,360
even more if you can
更多,如果你能够
79
00:02:45,510 --> 00:02:47,180
get even larger training sets.
获取到更大的训练集合
80
00:02:47,460 --> 00:02:48,590
And what you can do, is then train
然后,你可以
81
00:02:48,910 --> 00:02:50,160
in your network or some
在你的网络中训练,或者使用
82
00:02:50,510 --> 00:02:52,420
other learning algorithm to
其他学习算法
83
00:02:52,610 --> 00:02:54,570
take this input, an image
来接收这个输入,一个82*36
84
00:02:54,970 --> 00:02:56,710
patch of dimension 82 by
的小图块
85
00:02:56,850 --> 00:02:59,180
36, and to classify 'y'
来划分y
86
00:02:59,710 --> 00:03:01,070
and to classify that image patch
来划分每个图块是否
87
00:03:01,510 --> 00:03:03,850
as either containing a pedestrian or not.
包含一个行人
88
00:03:05,250 --> 00:03:06,250
So this gives you a way
So 这给了你一个
89
00:03:06,470 --> 00:03:08,050
of applying supervised learning in
应用监督学习的方法
90
00:03:08,210 --> 00:03:09,290
order to take an image
来对一个图块进行处理
91
00:03:09,530 --> 00:03:12,420
patch can determine whether or not a pedestrian appears in that image capture.
判断其是否包含有行人
92
00:03:14,310 --> 00:03:15,190
Now, lets say we get
现在,假设我们得到
93
00:03:15,400 --> 00:03:16,520
a new image, a test set
一个新的图片,一个测试集合
94
00:03:16,850 --> 00:03:17,920
image like this and we
图片(类似这个)
95
00:03:18,030 --> 00:03:20,240
want to try to find a pedestrian's picture image.
我们尝试寻找一个行人的图片
96
00:03:21,520 --> 00:03:22,340
What we would do is start
我们首先
97
00:03:22,670 --> 00:03:25,140
by taking a rectangular patch of this image.
在图片中选取一个矩形块
98
00:03:25,580 --> 00:03:26,800
Like that shown up here, so
像这里标注的,
99
00:03:26,900 --> 00:03:27,930
that's maybe a 82 X
例如这是图片中的一个
100
00:03:28,010 --> 00:03:29,440
36 patch of this image,
82*36的图块
101
00:03:30,270 --> 00:03:31,530
and run that image patch through
在我们的分类器里
102
00:03:31,830 --> 00:03:33,660
our classifier to determine whether
运行这个图块,验证
103
00:03:33,840 --> 00:03:34,900
or not there is a
是否图块中
104
00:03:34,980 --> 00:03:36,310
pedestrian in that image patch,
是否有行人
105
00:03:36,620 --> 00:03:38,100
and hopefully our classifier will
期望我们的分类器返回
106
00:03:38,260 --> 00:03:40,600
return y equals 0 for that patch, since there is no pedestrian.
0或者1,对应是否有行人
107
00:03:42,020 --> 00:03:42,900
Next, we then take that green
接下来,我们将
108
00:03:43,140 --> 00:03:44,380
rectangle and we slide it
绿色矩形
109
00:03:44,490 --> 00:03:45,680
over a bit and then
滑动一点
110
00:03:45,940 --> 00:03:47,180
run that new image patch
然后通过
111
00:03:47,560 --> 00:03:49,700
through our classifier to decide if there's a pedestrian there.
我们的分类器来决定是否有行人。
112
00:03:50,760 --> 00:03:51,740
And having done that, we then
完成后,我们
113
00:03:51,920 --> 00:03:53,070
slide the window further to the
滑动窗口向右
114
00:03:53,160 --> 00:03:54,160
right and run that patch
再次
115
00:03:54,420 --> 00:03:56,690
through the classifier again.
运行分类器
116
00:03:56,970 --> 00:03:57,850
The amount by which you shift
每次矩形
117
00:03:58,280 --> 00:03:59,770
the rectangle over each time
移动距离
118
00:04:00,260 --> 00:04:01,720
is a parameter, that's sometimes
是一个参数,有时
119
00:04:02,190 --> 00:04:04,000
called the step size of the
称之为步长
120
00:04:04,070 --> 00:04:06,020
parameter, sometimes also called
有时也被称为
121
00:04:06,380 --> 00:04:08,970
the slide parameter, and if
滑动参数,如果
122
00:04:09,120 --> 00:04:11,050
you step this one pixel at a time.
你一次移动一个像素
123
00:04:11,210 --> 00:04:12,020
So you can use the step size
所以你可以使用步长
124
00:04:12,360 --> 00:04:14,020
or stride of 1, that usually
为1,通常
125
00:04:14,340 --> 00:04:15,560
performs best, but is
表现最好,但
126
00:04:15,700 --> 00:04:16,960
more computational expensive, and
计算成本较高,如果
127
00:04:17,430 --> 00:04:18,940
so using a step size of
使用步长
128
00:04:19,090 --> 00:04:20,010
maybe 4 pixels at a
为4像素
129
00:04:20,210 --> 00:04:20,970
time, or eight pixels at a
或8像素
130
00:04:21,250 --> 00:04:22,350
time or some large number of
或一些更大的数
131
00:04:22,550 --> 00:04:23,600
pixels might be more common,
可能更常见
132
00:04:24,010 --> 00:04:25,320
since you're then moving the
因为你每次
133
00:04:25,430 --> 00:04:26,570
rectangle a little bit
你移动的距离
134
00:04:26,700 --> 00:04:28,570
more each time.
可以更大
135
00:04:28,870 --> 00:04:30,090
So, using this process, you continue
所以,使用这个程序,你继续
136
00:04:30,870 --> 00:04:32,310
stepping the rectangle over to
向右移动矩形
137
00:04:32,340 --> 00:04:33,160
the right a bit at a
每次一点点距离
138
00:04:33,370 --> 00:04:34,450
time and running each of
然后运行分类器
139
00:04:34,520 --> 00:04:35,780
these patches through a classifier,
对图块进行分类
140
00:04:36,620 --> 00:04:38,220
until eventually, as you
直到最后,随着
141
00:04:38,900 --> 00:04:42,080
slide this window over the
你在图片的不同位置
142
00:04:42,150 --> 00:04:43,340
different locations in the image,
滑动这个矩形
143
00:04:43,550 --> 00:04:44,680
first starting with the first
首先从第一行
144
00:04:44,850 --> 00:04:46,080
row and then we
然后我们
145
00:04:46,160 --> 00:04:47,580
go further rows in
滑动到下一行
146
00:04:47,710 --> 00:04:49,100
the image, you would
你使用某个某个步长
147
00:04:49,290 --> 00:04:50,490
then run all of
对这些不同的图块
148
00:04:50,550 --> 00:04:52,070
these different image patches at
应用某个步长
149
00:04:52,240 --> 00:04:53,330
some step size or some
通过分类器
150
00:04:53,430 --> 00:04:54,990
stride through your classifier.
进行分类
151
00:04:56,990 --> 00:04:57,870
Now, that was a pretty
现在,这是一个相当
152
00:04:57,970 --> 00:04:59,870
small rectangle, that would only
小的矩形,这只会
153
00:05:00,310 --> 00:05:02,310
detect pedestrians of one specific size.
检测一个特定大小的行人。
154
00:05:02,780 --> 00:05:04,210
What we do next is
接下来我们做什么
155
00:05:04,470 --> 00:05:05,990
start to look at larger image patches.
开始使用更大的图块
156
00:05:06,730 --> 00:05:08,270
So now let's take larger images
现在让我们以更大的图片
157
00:05:08,610 --> 00:05:09,700
patches, like those shown here
块,如图
158
00:05:10,310 --> 00:05:11,960
and run those through the classifier as well.
然后也通过分类器运行
159
00:05:13,540 --> 00:05:14,320
And by the way when I say
当我说
160
00:05:14,600 --> 00:05:15,830
take a larger image patch, what
以较大的图块,
161
00:05:16,080 --> 00:05:17,780
I really mean is when you
我的意思是当你
162
00:05:17,860 --> 00:05:18,850
take an image patch like this,
选取这样的图块,
163
00:05:19,490 --> 00:05:20,720
what you're really doing is taking
你真正做的是
164
00:05:20,880 --> 00:05:22,110
that image patch, and resizing
选择图像块,并调整大小
165
00:05:22,800 --> 00:05:24,750
it down to 82 X 36, say.
下降到82×36,
166
00:05:25,000 --> 00:05:26,260
So you take this larger
所以你拿这个更大的
167
00:05:26,550 --> 00:05:28,180
patch and re-size it to
块和调整其大小
168
00:05:28,300 --> 00:05:29,800
be a smaller image and then
成为更小的图,然后
169
00:05:29,970 --> 00:05:31,260
the smaller re-sized image
用这个图块
170
00:05:31,600 --> 00:05:32,620
that is what you
在分类器中
171
00:05:32,990 --> 00:05:35,340
would pass through your classifier to try and decide if there is a pedestrian in that patch.
运行,然后决定是否有行人。
172
00:05:37,230 --> 00:05:38,310
And finally you can do
最后你可以
173
00:05:38,470 --> 00:05:39,530
this at an even larger
在一个更大
174
00:05:39,930 --> 00:05:41,870
scales and run
规模做这一步
175
00:05:42,080 --> 00:05:43,830
that side of Windows to
运行滑动窗口知直到
176
00:05:43,980 --> 00:05:45,920
the end And after
结束,经过
177
00:05:45,980 --> 00:05:47,480
this whole process hopefully your algorithm
这个过程,希望你的算法
178
00:05:48,040 --> 00:05:49,670
will detect whether theres pedestrian
将检测到是否有行人
179
00:05:50,140 --> 00:05:52,070
appears in the image, so
在图中出现,所以
180
00:05:52,470 --> 00:05:53,850
thats how you train a
这就是你如何训练一个
181
00:05:54,290 --> 00:05:55,630
the classifier, and then
分类器,然后
182
00:05:55,890 --> 00:05:57,360
use a sliding windows classifier,
使用滑动窗口分类,
183
00:05:57,920 --> 00:05:59,820
or use a sliding windows detector in
或使用一个滑动窗口检测器
184
00:05:59,970 --> 00:06:01,740
order to find pedestrians in the image.
去寻找图像中的行人。
185
00:06:03,070 --> 00:06:04,050
Let's have a turn to the
让我们转向
186
00:06:04,150 --> 00:06:05,910
text detection example and talk
文本检测的例子,讨论
187
00:06:06,100 --> 00:06:07,490
about that stage in our
那个阶段,在我们
188
00:06:07,790 --> 00:06:09,330
photo OCR pipeline, where our
的照片OCR管道,我们
189
00:06:09,570 --> 00:06:11,340
goal is to find the text regions in unit.
的目标是找到一个个的文本区域。
190
00:06:13,250 --> 00:06:15,010
similar to pedestrian detection you
与行人检测类似,你
191
00:06:15,250 --> 00:06:16,730
can come up with a label
能拿到具有标签的
192
00:06:17,030 --> 00:06:18,410
training set with positive examples
的正例集合
193
00:06:19,060 --> 00:06:20,930
and negative examples with examples
和负例集合
194
00:06:21,530 --> 00:06:23,810
corresponding to regions where text appears.
对应文字出现的区域
195
00:06:24,300 --> 00:06:27,290
So instead of trying to detect pedestrians, we're now trying to detect texts.
所以不再进行行人检测,我们现在尝试检测文本。
196
00:06:28,130 --> 00:06:29,670
And so positive examples are going
正面的样本是
197
00:06:29,770 --> 00:06:31,640
to be patches of images where there is text.
具有文字的图块
198
00:06:31,970 --> 00:06:33,330
And negative examples is going
负面的样本是
199
00:06:33,380 --> 00:06:36,000
to be patches of images where there isn't text.
没有文字的
200
00:06:36,330 --> 00:06:37,530
Having trained this we can
训练完之后我们