forked from fengdu78/Coursera-ML-AndrewNg-Notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path15 - 4 - Developing and Evaluating an Anomaly Detection System (13 min).srt
1951 lines (1561 loc) · 36 KB
/
15 - 4 - Developing and Evaluating an Anomaly Detection System (13 min).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:00,120 --> 00:00:01,220
In the last video, we developed
在上一段视频中
(字幕整理:中国海洋大学 黄海广,haiguang2000@qq.com )
2
00:00:01,850 --> 00:00:03,200
an anomaly detection algorithm.
我们推导了异常检测算法
3
00:00:04,150 --> 00:00:05,240
In this video, I like to
在这段视频中
4
00:00:05,300 --> 00:00:06,870
talk about the process of how
我想介绍一下
5
00:00:07,090 --> 00:00:08,750
to go about developing a specific
如果开发一个
6
00:00:09,060 --> 00:00:10,790
application of anomaly detection
关于异常检测的应用
7
00:00:11,410 --> 00:00:12,810
to a problem and in particular
来解决一个实际问题
8
00:00:13,470 --> 00:00:14,500
this will focus on the problem
具体来说
9
00:00:15,090 --> 00:00:18,700
of how to evaluate an anomaly detection algorithm. In
我们将重点关注如何评价一个异常检测算法
10
00:00:18,880 --> 00:00:20,490
previous videos, we've already talked
在前面的视频中 我们已经提到了
11
00:00:20,800 --> 00:00:22,380
about the importance of real
使用实数评价法的重要性
12
00:00:22,570 --> 00:00:24,770
number evaluation and this captures the idea that
这样做的想法是
13
00:00:25,170 --> 00:00:26,810
when you're trying to develop
当你在用某个学习算法
14
00:00:27,270 --> 00:00:28,460
a learning algorithm for a
来开发一个具体的
15
00:00:28,690 --> 00:00:30,300
specific application, you need to
机器学习应用时
16
00:00:30,560 --> 00:00:31,540
often make a lot of choices
你常常需要做出很多决定
17
00:00:31,710 --> 00:00:34,410
like, you know, choosing what features to use and then so on.
比如说 选择用什么样的特征 等等
18
00:00:35,010 --> 00:00:36,800
And making decisions about all
而如果你找到某种
19
00:00:36,880 --> 00:00:38,540
of these choices is often much
评价算法的方式
20
00:00:38,780 --> 00:00:39,890
easier, and if you have
直接返回一个数字
21
00:00:40,040 --> 00:00:41,330
a way to evaluate your learning
来告诉你算法的好坏
22
00:00:41,410 --> 00:00:43,190
algorithm that just gives you back a number.
那么你做这些决定就显得更容易了
23
00:00:44,200 --> 00:00:44,950
So if you're trying to decide,
所以比如你要决定
24
00:00:45,980 --> 00:00:47,130
you know, I have an idea for
现在有一个额外的特征
25
00:00:47,220 --> 00:00:49,730
one extra feature, do I include this feature or not.
我要不要把这个特征考虑进来?
26
00:00:50,560 --> 00:00:51,560
If you can run the algorithm
如果你带上这个特征
27
00:00:51,760 --> 00:00:52,830
with the feature, and run the
运行你的算法
28
00:00:52,960 --> 00:00:54,420
algorithm without the feature, and
再去掉这个特征运行你的算法
29
00:00:54,570 --> 00:00:55,960
just get back a number that
然后得到某个返回的数字
30
00:00:56,100 --> 00:00:57,350
tells you, you know, did
这个数字就直接告诉你
31
00:00:57,460 --> 00:01:00,070
it improve or worsen performance to add this feature?
这个特征到底是让算法表现变好了还是变差了
32
00:01:00,670 --> 00:01:01,480
Then it gives you a much better
这样 你就有了一种更好
33
00:01:01,670 --> 00:01:04,370
way, a much simpler way, with which
更简单的方法
34
00:01:04,590 --> 00:01:06,110
to decide whether or not to include that feature.
来确定是不是应该加上这个特征
35
00:01:07,570 --> 00:01:09,010
So in order to be
为了更快地
36
00:01:09,200 --> 00:01:10,850
able to develop an anomaly
开发出一个
37
00:01:11,410 --> 00:01:13,880
detection system quickly, it would
异常检测系统
38
00:01:14,080 --> 00:01:14,960
be a really helpful to have
那么最好能找到某种
39
00:01:15,150 --> 00:01:17,820
a way of evaluating an anomaly detection system.
评价异常检测系统的方法
40
00:01:19,260 --> 00:01:20,420
In order to do this,
为了做到这一点
41
00:01:20,790 --> 00:01:22,380
in order to evaluate an anomaly
为了能评价一个
42
00:01:22,730 --> 00:01:24,080
detection system, we're
异常检测系统
43
00:01:24,310 --> 00:01:26,380
actually going to assume have some labeled data.
我们先假定已有了一些带标签的数据
44
00:01:27,270 --> 00:01:28,270
So, so far, we'll be treating
所以 我们要考虑的
45
00:01:28,420 --> 00:01:29,870
anomaly detection as an
异常检测问题
46
00:01:30,310 --> 00:01:31,770
unsupervised learning problem, using
是一个非监督问题
47
00:01:32,210 --> 00:01:33,560
unlabeled data.
使用的是无标签数据
48
00:01:34,010 --> 00:01:35,190
But if you have some labeled
但如果你有一些
49
00:01:35,560 --> 00:01:37,390
data that specifies what
带标签的数据
50
00:01:37,700 --> 00:01:39,570
are some anomalous examples, and
能够指明哪些是异常样本
51
00:01:39,670 --> 00:01:42,030
what are some non-anomalous examples, then
哪些是非异常样本
52
00:01:42,470 --> 00:01:43,350
this is how we actually
那么这就是我们要找的
53
00:01:43,630 --> 00:01:45,670
think of as the standard way of evaluating an anomaly detection algorithm.
能够评价异常检测算法的标准方法
54
00:01:45,820 --> 00:01:49,020
So taking the
还是以
55
00:01:49,300 --> 00:01:50,580
aircraft engine example again.
飞机发动机的为例
56
00:01:51,010 --> 00:01:52,680
Let's say that, you know, we have some
现在假如你有了一些
57
00:01:53,070 --> 00:01:55,840
label data of just a few anomalous
带标签数据
58
00:01:56,330 --> 00:01:57,890
examples of some aircraft engines
也就是有异常的飞机引擎的样本
59
00:01:58,400 --> 00:02:00,780
that were manufactured in the past that turns out to be anomalous.
这批制造的飞机发动机是有问题的
60
00:02:01,520 --> 00:02:02,360
Turned out to be flawed or strange in some way.
可能有瑕疵 或者别的什么问题
61
00:02:02,400 --> 00:02:04,130
Let's say we
同时我们还有
62
00:02:04,360 --> 00:02:05,750
use we also have some non-anomalous
一些无异常的样本
63
00:02:06,100 --> 00:02:07,810
examples, so some
也就是一些
64
00:02:08,050 --> 00:02:10,200
perfectly okay examples.
完全没问题的样本
65
00:02:10,940 --> 00:02:12,050
I'm going to use y equals 0
我用y=0来表示那些
66
00:02:12,110 --> 00:02:13,600
to denote the normal or the
完全正常
67
00:02:13,790 --> 00:02:15,470
non-anomalous example and
没有问题的样本
68
00:02:15,530 --> 00:02:21,450
y equals 1 to denote the anomalous examples.
用y=1来代表那些异常样本
69
00:02:22,450 --> 00:02:24,670
The process of developing and evaluating an anomaly
那么异常检测算法的推导和评价方法
70
00:02:25,130 --> 00:02:26,450
detection algorithm is as follows.
如下所示
71
00:02:27,500 --> 00:02:28,300
We're going to think of it as
我们先考虑
72
00:02:28,560 --> 00:02:29,830
a training set and talk
训练样本
73
00:02:30,000 --> 00:02:31,310
about the cross validation in test
交叉验证和测试集等下考虑
74
00:02:31,440 --> 00:02:32,580
sets later, but the training set we usually
对于训练集
75
00:02:33,280 --> 00:02:34,000
think of this as still the unlabeled
我们还是看成无标签的
76
00:02:35,040 --> 00:02:36,180
training set.
训练集
77
00:02:36,510 --> 00:02:37,250
And so this is our large
所以这些就是
78
00:02:37,560 --> 00:02:39,580
collection of normal, non-anomalous
所有正常的
79
00:02:40,190 --> 00:02:41,130
or not anomalous examples.
或者说无异常样本的集合
80
00:02:42,400 --> 00:02:43,530
And usually we think
通常来讲
81
00:02:43,690 --> 00:02:44,750
of this as being as non-anomalous,
我们把这些都看成无异常的
82
00:02:45,010 --> 00:02:46,490
but it's actually okay even
但可能有一些异常的
83
00:02:46,740 --> 00:02:48,660
if a few anomalies slip into
也被分到你的训练集里
84
00:02:48,660 --> 00:02:51,240
your unlabeled training set.
这也没关系
85
00:02:51,420 --> 00:02:52,100
And next we are going to
接下来我们要
86
00:02:52,310 --> 00:02:53,830
define a cross validation set
定义交叉验证集
87
00:02:54,100 --> 00:02:55,510
and a test set, with which
和测试集
88
00:02:55,750 --> 00:02:58,360
to evaluate a particular anomaly detection algorithm.
通过这两个集合我们将得到异常检测算法
89
00:02:59,230 --> 00:03:00,850
So, specifically, for both the
具体来说
90
00:03:01,000 --> 00:03:01,960
cross validation test sets we're
对交叉验证集和测试集
91
00:03:02,080 --> 00:03:03,590
going to assume that, you know, we
我们将假设
92
00:03:03,800 --> 00:03:05,030
can include a few examples
我们的交叉验证集
93
00:03:05,670 --> 00:03:06,720
in the cross validation set and
和测试集中
94
00:03:06,900 --> 00:03:08,150
the test set that contain examples
有一些样本
95
00:03:08,910 --> 00:03:09,660
that are known to be anomalous.
这些样本都是异常的
96
00:03:10,200 --> 00:03:11,410
So the test sets say
所以比如测试集
97
00:03:11,950 --> 00:03:13,270
we have a few examples with
里面的样本就是
98
00:03:13,340 --> 00:03:14,770
y equals 1 that
带标签y=1的
99
00:03:15,040 --> 00:03:17,470
correspond to anomalous aircraft engines.
这表示有异常的飞机引擎
100
00:03:18,640 --> 00:03:19,800
So here's a specific example.
这是一个具体的例子
101
00:03:20,930 --> 00:03:23,120
Let's say that, altogether, this
假如说
102
00:03:23,280 --> 00:03:24,990
is the data that we have.
这是我们总的数据
103
00:03:25,260 --> 00:03:27,410
We have manufactured 10,000 examples
我们有10000制造的引擎
104
00:03:28,130 --> 00:03:29,140
of engines that, as far
作为样本
105
00:03:29,450 --> 00:03:30,740
as we know we're perfectly normal,
就我们所知 这些样本
106
00:03:31,220 --> 00:03:33,110
perfectly good aircraft engines.
都是正常没有问题的飞机引擎
107
00:03:34,060 --> 00:03:35,240
And again, it turns out to be okay even
同样地 如果有一小部分
108
00:03:35,560 --> 00:03:37,310
if a few flawed engine
有问题的引擎
109
00:03:37,740 --> 00:03:39,400
slips into the set of
也被混入了这10000个样本
110
00:03:39,550 --> 00:03:40,860
10,000 is actually okay, but
别担心 没有关系
111
00:03:40,970 --> 00:03:41,970
we kind of assumed that the vast
我们假设
112
00:03:42,410 --> 00:03:44,300
majority of these
这10000个样本中
113
00:03:44,500 --> 00:03:47,660
10,000 examples are, you know, good and normal non-anomalous engines.
大多数都是好的 没有问题的引擎
114
00:03:48,480 --> 00:03:50,940
And let's say that, you know, historically, however
而且实际上 从过去的经验来看
115
00:03:51,200 --> 00:03:52,120
long we've been running on manufacturing
无论是制造了多少年
116
00:03:52,650 --> 00:03:54,130
plant, let's say that
引擎的工厂
117
00:03:54,480 --> 00:03:55,930
we end up getting features,
我们都会得到这些数据
118
00:03:56,440 --> 00:03:57,970
getting 24 to 28
都会得到大概24到28个
119
00:03:58,240 --> 00:04:00,180
anomalous engines as well.
有问题的引擎
120
00:04:01,120 --> 00:04:03,030
And for a pretty typical application of
对于异常检测的典型应用来说
121
00:04:03,310 --> 00:04:05,490
anomaly detection, you know, the number non-anomalous
异常样本的个数
122
00:04:06,740 --> 00:04:08,090
examples, that is with y equals
也就是y=1的样本
123
00:04:08,760 --> 00:04:10,650
1, we may have anywhere from, you know, 20 to 50.
基本上很多都是20到50个
124
00:04:10,820 --> 00:04:12,920
It would be a pretty typical
通常这个范围
125
00:04:13,360 --> 00:04:14,570
range of examples, number of
对y=1的样本数量
126
00:04:14,830 --> 00:04:16,710
examples that we have with y equals 1.
还是很常见的
127
00:04:16,910 --> 00:04:17,730
And usually we will have a
并且通常我们的
128
00:04:17,860 --> 00:04:20,000
much larger number of good examples.
正常样本的数量要大得多
129
00:04:21,810 --> 00:04:23,150
So, given this data set,
有了这组数据
130
00:04:24,180 --> 00:04:25,410
a fairly typical way to split
把数据分为训练集
131
00:04:25,850 --> 00:04:27,150
it into the training set,
交叉验证集和测试集
132
00:04:27,430 --> 00:04:29,210
cross validation set and test set would be as follows.
一种典型的分法如下
133
00:04:30,390 --> 00:04:31,880
Let's take 10,000 good aircraft
我们把这10000个正常的引擎
134
00:04:32,360 --> 00:04:34,060
engines and put 6,000
放6000个到
135
00:04:34,260 --> 00:04:37,100
of that into the unlabeled training set.
无标签的训练集中
136
00:04:37,620 --> 00:04:38,800
So, I'm calling this an unlabeled training
我叫它“无标签训练集”
137
00:04:39,130 --> 00:04:40,050
set but all of these examples
但其实所有这些样本
138
00:04:40,640 --> 00:04:42,510
are really ones that correspond to
实际上都对应
139
00:04:42,810 --> 00:04:44,380
y equals 0, as far as we know.
y=0的情况 至少据我们所知是这样
140
00:04:45,300 --> 00:04:46,350
And so, we will use this to
所以 我们要用它们
141
00:04:46,520 --> 00:04:48,840
fit p of x, right.
来拟合p(x)
142
00:04:49,150 --> 00:04:49,850
So, we will use these 6000 engines
也就是是我们用这6000个引擎
143
00:04:50,350 --> 00:04:51,180
to fit p of x, which
来拟合p(x)
144
00:04:51,360 --> 00:04:52,190
is that p of x
也就是p 括号
145
00:04:52,420 --> 00:04:53,930
one parametrized by Mu
x1 参数是μ1
146
00:04:54,330 --> 00:04:56,380
1, sigma squared 1, up
σ1的平方
147
00:04:56,540 --> 00:04:57,700
to p of Xn parametrized
一直到p(xn; μn, σn^2)
148
00:04:58,370 --> 00:04:59,570
by Mu N sigma squared
参数是μn σn的平方
149
00:05:00,790 --> 00:05:02,300
n. And so it would be these
因此我们就是要用这
150
00:05:02,500 --> 00:05:03,930
6,000 examples that we would
6000个样本
151
00:05:04,110 --> 00:05:05,370
use to estimate the parameters
来估计参数
152
00:05:05,590 --> 00:05:06,760
Mu 1, sigma squared 1,
μ1, σ1
153
00:05:07,140 --> 00:05:08,960
up to Mu N, sigma
一直到
154
00:05:09,200 --> 00:05:10,280
squared N. And so that's our training
μn, σn
155
00:05:10,500 --> 00:05:11,960
set of all, you know,
这就是训练集中的好的样本
156
00:05:12,150 --> 00:05:13,980
good, or the vast majority of good examples.
或者说大多数好的样本
157
00:05:15,430 --> 00:05:16,950
Next we will take our good
然后 我们取一些
158
00:05:17,140 --> 00:05:18,380
aircraft engines and put some
好的飞机引擎样本
159
00:05:18,660 --> 00:05:19,470
number of them in a cross
放一些到交叉验证集
160
00:05:19,580 --> 00:05:21,320
validation set plus some number
再放一些到
161
00:05:21,570 --> 00:05:22,970
of them in the test sets.
测试集中
162
00:05:23,280 --> 00:05:24,300
So 6,000 plus 2,000 plus 2,000,
正好6000加2000加2000
163
00:05:24,480 --> 00:05:25,470
that's how we split up our
这10000个样本
164
00:05:25,740 --> 00:05:28,820
10,000 good aircraft engines.
就这样进行分割了
165
00:05:29,260 --> 00:05:31,460
And then we also have 20
同时 我们还有20个
166
00:05:31,930 --> 00:05:33,380
flawed aircraft engines, and we'll
异常的发动机样本
167
00:05:33,490 --> 00:05:34,890
take that and maybe split it
同样也把它们进行一个分割
168
00:05:35,160 --> 00:05:36,100
up, you know, put ten of them in
放10个到验证集中
169
00:05:36,200 --> 00:05:37,230
the cross validation set and put
剩下10个
170
00:05:37,370 --> 00:05:39,560
ten of them in the test sets.
放入测试集中
171
00:05:39,850 --> 00:05:41,320
And in the next slide
在下一张幻灯片中
172
00:05:41,660 --> 00:05:42,460
we will talk about how to
我们将看到如何用
173
00:05:42,750 --> 00:05:43,800
actually use this to evaluate
这些分好的数据
174
00:05:44,520 --> 00:05:46,330
the anomaly detection algorithm.
来推导异常检测的算法
175
00:05:48,130 --> 00:05:49,140
So what I have
好的
176
00:05:49,220 --> 00:05:50,610
just described here is a you
刚才我介绍的这些内容
177
00:05:50,790 --> 00:05:52,300
know probably the recommend a
可能是一种
178
00:05:52,440 --> 00:05:55,290
good way of splitting the labeled and unlabeled example.
比较推荐的方法来划分带标签和无标签的数据
179
00:05:55,820 --> 00:05:57,970
The good and the flawed aircraft engines.
来划分好的和坏的飞机引擎样本
180
00:05:58,480 --> 00:06:00,380
Where we use like
我们使用了
181
00:06:00,730 --> 00:06:01,650
a 60, 20, 20% split for
6:2:2的比例
182
00:06:01,800 --> 00:06:03,350
the good engines and we take
来分配好的引擎样本
183
00:06:03,570 --> 00:06:04,780
the flawed engines, and we
而坏的引擎样本
184
00:06:04,910 --> 00:06:05,750
put them just in the cross
我们只把它们放到
185
00:06:05,870 --> 00:06:06,940
validation set, and just in
交叉验证集和测试集中
186
00:06:07,030 --> 00:06:09,200
the test set, then we'll see in the next slide why that's the case.
在下一页中我们将讲解这样分的理由
187
00:06:10,370 --> 00:06:12,080
Just as an aside, if you
顺便说一下
188
00:06:12,360 --> 00:06:13,360
look at how people apply anomaly
如果你看到别人应用
189
00:06:13,750 --> 00:06:15,400
detection algorithms, sometimes you see
异常检测的算法时
190
00:06:15,510 --> 00:06:16,980
other peoples' split the data differently as well.
有时候也可能会有不同的分配方法
191
00:06:17,460 --> 00:06:19,400
So, another alternative, this is really
另一种分配数据的方法是这样的
192
00:06:19,660 --> 00:06:21,290
not a recommended alternative, but
其实我真的不推荐这么分
193
00:06:21,470 --> 00:06:23,650
some people want to
但就有人喜欢这么分
194
00:06:23,790 --> 00:06:24,770
take off your 10,000 good engines, maybe put 6000
也就是把10000个好的引擎分出6000个
195
00:06:24,820 --> 00:06:26,020
of them in your training set
放到训练集中
196
00:06:26,320 --> 00:06:27,130
and then put the same
然后把剩下的4000个样本
197
00:06:27,650 --> 00:06:28,800
4000 in the cross validation
既用作交叉验证集
198
00:06:30,380 --> 00:06:31,020
set and the test set.
也用作测试集
199
00:06:31,170 --> 00:06:32,030
And so, you know, we like to think of the cross
通常来说我们要交叉验证集
200
00:06:32,360 --> 00:06:33,340
validation set and the
和测试集当作是