forked from fengdu78/Coursera-ML-AndrewNg-Notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2 - 5 - Gradient Descent (11 min).srt
830 lines (692 loc) · 23.2 KB
/
2 - 5 - Gradient Descent (11 min).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
1
00:00:00,000 --> 00:00:04,934
We've previously defined the cost function
J. In this video I want to tell you about
我们已经定义了代价函数J 而在这段视频中
(字幕整理:中国海洋大学 黄海广,haiguang2000@qq.com )
2
00:00:04,934 --> 00:00:09,634
an algorithm called gradient descent for
minimizing the cost function J. It turns
我想向你们介绍梯度下降这种算法 这种算法可以将代价函数J最小化
3
00:00:09,634 --> 00:00:14,275
out gradient descent is a more general
algorithm and is used not only in linear
梯度下降是很常用的算法 它不仅被用在线性回归上
4
00:00:14,275 --> 00:00:18,916
regression. It's actually used all over
the place in machine learning. And later
它实际上被广泛的应用于机器学习领域中的众多领域
5
00:00:18,916 --> 00:00:23,791
in the class we'll use gradient descent to
minimize other functions as well, not just
在后面课程中 为了解决其他线性回归问题 我们也将使用梯度下降法
6
00:00:23,791 --> 00:00:27,845
the cost function J, for linear regression.
So in this video, I'm going to
最小化其他函数 而不仅仅是只用在本节课的代价函数J
7
00:00:27,845 --> 00:00:32,558
talk about gradient descent for minimizing
some arbitrary function J. And then in
因此在这个视频中 我将讲解用梯度下降算法最小化函数 J
8
00:00:32,558 --> 00:00:37,406
later videos, we'll take those algorithm
and apply it specifically to the cost
在后面的视频中 我们还会将此算法应用于具体的
9
00:00:37,406 --> 00:00:43,332
function J that we had to find for linear
regression. So here's the problem
代价函数J中来解决线性回归问题 下面是问题概述
10
00:00:43,332 --> 00:00:48,112
setup. We're going to see that we have
some function J of (theta0, theta1).
在这里 我们有一个函数J(θ0, θ1)
11
00:00:48,112 --> 00:00:52,773
Maybe it's a cost function from linear
regression. Maybe it's some other function
也许这是一个线性回归的代价函数 也许是一些其他函数
12
00:00:52,773 --> 00:00:56,801
we want to minimize. And we want
to come up with an algorithm for
要使其最小化 我们需要用一个算法
13
00:00:56,801 --> 00:01:01,174
minimizing that as a function of J of
(theta0, theta1). Just as an aside,
来最小化函数J(θ0, θ1) 就像刚才说的
14
00:01:01,174 --> 00:01:05,793
it turns out that gradient descent
actually applies to more general
事实证明 梯度下降算法可应用于
15
00:01:05,793 --> 00:01:10,994
functions. So imagine if you have
a function that's a function of
多种多样的函数求解 所以想象一下如果你有一个函数
16
00:01:10,994 --> 00:01:16,194
J of (theta0, theta1, theta2, up to
some theta n). And you want to
J(θ0, θ1, θ2, ...,θn )
17
00:01:16,405 --> 00:01:21,795
minimize over (theta0 up to theta n)
of this J of (theta0 up to theta n).
你希望可以通过最小化 θ0到θn 来最小化此代价函数J(θ0 到θn)
18
00:01:21,795 --> 00:01:26,580
It turns out gradient descent
is an algorithm for solving
用n个θ是为了证明梯度下降算法可以解决更一般的问题
19
00:01:26,580 --> 00:01:31,368
this more general problem, but for the
sake of brevity, for the sake of
但为了简洁起见 为了简化符号
20
00:01:31,368 --> 00:01:35,935
your succinctness of notation, I'm just
going to present only two parameters
在接下来的视频中 我只用两个参数
21
00:01:36,113 --> 00:01:41,097
throughout the rest of this video. Here's
the idea for gradient descent. What we're
下面就是关于梯度下降的构想
22
00:01:41,097 --> 00:01:45,882
going to do is we're going to start off
with some initial guesses for theta0 and
我们要做的是 我们要开始对θ0和θ1 进行一些初步猜测
23
00:01:45,882 --> 00:01:50,788
theta1. Doesn't really matter what they
are, but a common choice would be if we
它们到底是什么其实并不重要 但通常的选择是将 θ0设为0
24
00:01:50,788 --> 00:01:55,452
set theta0 to 0, and
set theta1 to 0. Just initialize
将θ1也设为0 将它们都初始化为0
25
00:01:55,452 --> 00:02:00,322
them to 0. What we're going to do in
gradient descent is we'll keep changing
我们在梯度下降算法中要做的
26
00:02:00,322 --> 00:02:05,258
theta0 and theta1 a little bit to
try to reduce J of (theta0, theta1)
就是不停地一点点地改变 θ0和θ1 试图通过这种改变使得J(θ0, θ1)变小
27
00:02:05,258 --> 00:02:10,571
until hopefully we wind up at a minimum or
maybe a local minimum. So, let's see
直到我们找到 J 的最小值 或许是局部最小值
28
00:02:10,796 --> 00:02:16,106
see pictures of what gradient descent
does. Let's say I try to minimize this
让我们通过一些图片来看看梯度下降法是如何工作的
29
00:02:16,106 --> 00:02:20,849
function. So notice the axes. This is,
(theta0, theta1) on the horizontal
我在试图让这个函数值最小 注意坐标轴 θ0和θ1在水平轴上
30
00:02:20,849 --> 00:02:25,774
axes, and J is a vertical axis. And so the
height of the surface shows J, and we
而函数 J在垂直坐标轴上 图形表面高度则是 J的值
31
00:02:25,774 --> 00:02:30,582
want to minimize this function. So, we're
going to start off with (theta0, theta1)
我们希望最小化这个函数 所以我们从 θ0和θ1的某个值出发
32
00:02:30,582 --> 00:02:35,375
at some point. So imagine picking some
value for (theta0, theta1), and that
所以想象一下 对 θ0和θ1赋以某个初值
33
00:02:35,375 --> 00:02:39,934
corresponds to starting at some point on
the surface of this function. Okay? So
也就是对应于从这个函数表面上的某个起始点出发 对吧
34
00:02:39,934 --> 00:02:44,201
whatever value of (theta0, theta1)
gives you some point here. I did
所以不管 θ0和θ1的取值是多少
35
00:02:44,201 --> 00:02:48,819
initialize them to 0, but
sometimes you initialize it to other
我将它们初始化为0 但有时你也可把它初始化为其他值
36
00:02:48,819 --> 00:02:53,942
values as well. Now. I want us to imagine
that this figure shows a hill. Imagine
现在我希望大家把这个图像想象为一座山
37
00:02:53,942 --> 00:02:59,178
this is like a landscape of some grassy
park with two hills like so.
想像类似这样的景色 公园中有两座山
38
00:02:59,178 --> 00:03:04,618
And I want you to imagine that you are
physically standing at that point on the
想象一下你正站立在山的这一点上
39
00:03:04,618 --> 00:03:09,990
hill on this little red hill in your park.
In gradient descent, what we're
站立在你想象的公园这座红色山上 在梯度下降算法中
40
00:03:09,990 --> 00:03:15,770
going to do is spin 360 degrees around and
just look all around us and ask, "If I were
我们要做的就是旋转360度 看看我们的周围 并问自己
41
00:03:15,770 --> 00:03:20,423
to take a little baby step in some
direction, and I want to go downhill as
我要在某个方向上 用小碎步尽快下山
42
00:03:20,423 --> 00:03:25,320
quickly as possible, what direction do I
take that little baby step in if I want to
如果我想要下山 如果我想尽快走下山
43
00:03:25,320 --> 00:03:29,686
go down, if I sort of want to physically
walk down this hill as rapidly as
这些小碎步需要朝什么方向?
44
00:03:29,686 --> 00:03:34,465
possible?" Turns out that if we're standing
at that point on the hill, you look all
如果我们站在山坡上的这一点
45
00:03:34,465 --> 00:03:39,185
around, you find that the best direction
to take a little step downhill
你看一下周围 你会发现最佳的下山方向
46
00:03:39,185 --> 00:03:44,035
is roughly that direction. Okay. And now
you're at this new point on your hill.
大约是那个方向 好的 现在你在山上的新起点上
47
00:03:44,035 --> 00:03:49,430
You're going to, again, look all around, and then
say, "What direction should I step in order
你再看看周围 然后再一次想想
48
00:03:49,430 --> 00:03:54,695
to take a little baby step downhill?" And
if you do that and take another step, you
我应该从什么方向迈着小碎步下山? 然后你按照自己的判断又迈出一步
49
00:03:54,695 --> 00:03:59,700
take a step in that direction, and then
you keep going. You know, from this new
往那个方向走了一步 然后重复上面的步骤
50
00:03:59,700 --> 00:04:04,835
point, you look around, decide what
direction will take you downhill most
从这个新的点 你环顾四周 并决定从什么方向将会最快下山
51
00:04:04,835 --> 00:04:09,775
quickly, take another step, another step,
and so on, until you converge to this,
然后又迈进了一小步 又是一小步
并依此类推 直到你接近这里
52
00:04:09,970 --> 00:04:15,059
local minimum down here. Further descent
has an interesting property. This first
直到局部最低点的位置 此外 这种下降有一个有趣的特点
53
00:04:15,059 --> 00:04:19,682
time we ran gradient descent, we were
starting at this point over here, right?
第一次我们是从这个点开始进行梯度下降算法的 是吧
54
00:04:19,682 --> 00:04:24,183
Started at that point over here. Now
imagine, we initialize gradient
在这一点上从这里开始 现在想象一下
55
00:04:24,183 --> 00:04:29,232
descent just a couple steps to the right.
Imagine we initialized gradient descent with
我们在刚才的右边一些的位置 对梯度下降进行初始化
56
00:04:29,232 --> 00:04:34,159
that point on the upper right. If you were
to repeat this process, and stop at the
想象我们在右边高一些的这个点 开始使用梯度下降 如果你重复上述步骤
57
00:04:34,159 --> 00:04:39,207
point, and look all around. Take a little
step in the direction of steepest descent.
停留在该点 并环顾四周 往下降最快的方向迈出一小步
58
00:04:39,207 --> 00:04:44,772
You would do that. Then look around, take
another step, and so on. And if you start
然后环顾四周 又迈出一步 然后如此往复
59
00:04:44,772 --> 00:04:50,570
it just a couple steps to the right, the
gradient descent will have taken you to
如果你从右边不远处开始 梯度下降算法将会带你来到
60
00:04:50,570 --> 00:04:56,236
this second local optimum over on the
right. So if you had started at this first
这个右边的第二个局部最优处 如果从刚才的第一个点出发
61
00:04:56,236 --> 00:05:01,602
point, you would have wound up at this
local optimum. But if you started just a
你会得到这个局部最优解 但如果你的起始点偏移了一些
62
00:05:01,602 --> 00:05:06,762
little bit, a slightly different location,
you would have wound up at a very
起始点的位置略有不同 你会得到一个
63
00:05:06,762 --> 00:05:12,197
different local optimum. And this is a
property of gradient descent that we'll
非常不同的局部最优解 这就是梯度下降算法的一个特点
64
00:05:12,197 --> 00:05:17,425
say a little bit more about later. So,
that's the intuition in pictures. Let's
我们会在之后继续探讨这个问题 好的 这是我们从图中得到的直观感受
65
00:05:17,425 --> 00:05:22,929
look at the map. This is the definition of
the gradient descent algorithm. We're
看看这个图 这是梯度下降算法的定义
66
00:05:22,929 --> 00:05:28,240
going to just repeatedly do this. On to
convergence. We're going to update my
我们将会反复做这些 直到收敛 我们要更新参数 θj
67
00:05:28,240 --> 00:05:33,543
parameter theta j by, you know, taking
theta j and subtracting from it alpha
方法是 用 θj 减去 α乘以这一部分
68
00:05:33,543 --> 00:05:39,129
times this term over here. So, let's
see. There are a lot of details in this
让我们来看看 这个公式有很多细节问题
69
00:05:39,129 --> 00:05:45,030
equation, so let me unpack some of it.
First, this notation here, colon equals.
我来详细讲解一下 首先 注意这个符号 :=
70
00:05:45,030 --> 00:05:51,643
We're going to use := to denote
assignment, so it's the assignment
我们使用 := 表示赋值 这是一个赋值运算符
71
00:05:51,643 --> 00:05:57,790
operator. So concretely, if I
write A: =B, what this means in
具体地说 如果我写 a:= b
72
00:05:57,790 --> 00:06:02,878
a computer, this means take the
value in B and use it to overwrite
在计算机专业内 这意味着不管 a的值是什么
73
00:06:02,878 --> 00:06:08,517
whatever the value of A. So this means we
will set A to be equal to the value of B.
取 b的值 并将其赋给a 这意味着我们让 a等于b的值
74
00:06:08,517 --> 00:06:13,674
Okay, it's assignment. I can
also do A:=A+1. This means
这就是赋值 我也可以做 a:= a+1
75
00:06:13,674 --> 00:06:18,969
take A and increase its value by one.
Whereas in contrast, if I use the equals
这意味着 取出a值 并将其增加1 与此不同的是
76
00:06:18,969 --> 00:06:26,067
sign and I write A=B, then this is a
truth assertion. So if I write A=B,
如果我使用等号 = 并且写出a=b 那么这是一个判断为真的声明
77
00:06:26,067 --> 00:06:31,006
then I'm asserting that the value of A
equals to the value of B. So the left hand
如果我写 a=b 就是在断言 a的值是等于 b的值的
78
00:06:31,006 --> 00:06:36,331
side, that's a computer operation, where
you set the value of A to be a value. The
在左边这里 这是计算机运算 将一个值赋给 a
79
00:06:36,331 --> 00:06:41,399
right hand side, this is asserting, I'm
just making a claim that the values of A
而在右边这里 这是声明 声明 a的值 与b的值相同
80
00:06:41,399 --> 00:06:46,274
and B are the same. And so, whereas I can
write A: =A+1, that means increment A by
因此 我可以写 a:=a+1 这意味着 将 a的值再加上1
81
00:06:46,274 --> 00:06:50,764
1. Hopefully, I won't ever write A=A+1.
Because that's just wrong.
但我不会写 a=a+1 因为这本来就是错误的
82
00:06:50,764 --> 00:06:55,704
A and A+1 can never be equal to
the same values. So that's the first
a 和 a+1 永远不会是同一个值 这是这个定义的第一个部分
83
00:06:55,704 --> 00:07:05,733
part of the definition. This alpha
here is a number that is called the
这里的α 是一个数字 被称为学习速率
84
00:07:05,733 --> 00:07:12,360
learning rate. And what alpha does is, it
basically controls how big a step we take
什么是α呢? 在梯度下降算法中 它控制了
85
00:07:12,360 --> 00:07:17,113
downhill with gradient descent. So if
alpha is very large, then that corresponds
我们下山时会迈出多大的步子 因此如果 α值很大
86
00:07:17,113 --> 00:07:21,925
to a very aggressive gradient descent
procedure, where we're trying to take huge
那么相应的梯度下降过程中 我们会试图用大步子下山
87
00:07:21,925 --> 00:07:26,322
steps downhill. And if alpha is very
small, then we're taking little, little
如果α值很小 那么我们会迈着很小的小碎步下山
88
00:07:26,322 --> 00:07:31,194
baby steps downhill. And, I'll come
back and say more about this later.
关于如何设置 α的值等内容 在之后的课程中
89
00:07:31,194 --> 00:07:35,660
About how to set alpha and so on.
And finally, this term here. That's the
我会回到这里并且详细说明 最后 是公式的这一部分
90
00:07:35,660 --> 00:07:40,582
derivative term, I don't want to talk
about it right now, but I will derive this
这是一个微分项 我现在不想谈论它
91
00:07:40,582 --> 00:07:45,564
derivative term and tell you exactly what
this is based on. And some of you
但我会推导出这个微分项 并告诉你到底这要如何计算
92
00:07:45,564 --> 00:07:50,547
will be more familiar with calculus than
others, but even if you aren't familiar
你们中有人大概比较熟悉微积分 但即使你不熟悉微积分
93
00:07:50,547 --> 00:07:55,469
with calculus don't worry about it, I'll
tell you what you need to know about this
也不用担心 我会告诉你 对这一项 你最后需要做什么
94
00:07:55,469 --> 00:08:00,580
term here. Now there's one more subtlety
about gradient descent which is, in
现在 在梯度下降算法中 还有一个更微妙的问题
95
00:08:00,580 --> 00:08:05,837
gradient descent, we're going to
update theta0 and theta1. So
在梯度下降中 我们要更新 θ0和θ1
96
00:08:05,837 --> 00:08:10,699
this update takes place where j=0, and
j=1. So you're going to update j, theta0,
当 j=0 和 j=1 时 会产生更新 所以你将更新 J θ0还有θ1
97
00:08:10,699 --> 00:08:15,955
and update theta1. And the subtlety of
how you implement gradient descent is,
实现梯度下降算法的微妙之处是
98
00:08:15,955 --> 00:08:21,562
for this expression, for this
update equation, you want to
在这个表达式中 如果你要更新这个等式
99
00:08:21,562 --> 00:08:31,384
simultaneously update theta0 and
theta1. What I mean by that is
你需要同时更新 θ0和θ1
100
00:08:31,384 --> 00:08:36,432
that in this equation,
we're going to update
我的意思是在这个等式中 我们要这样更新
101
00:08:36,432 --> 00:08:40,975
theta0:=theta0 - something, and update
theta1:=theta1 - something.
θ0:=θ0 - 一些东西 并更新 θ1:=θ1 - 一些东西
102
00:08:40,975 --> 00:08:45,834
And the way to implement this is,
you should compute the right hand
实现方法是 你应该计算公式右边的部分
103
00:08:45,834 --> 00:08:52,677
side. Compute that thing for both theta0
and theta1, and then simultaneously at
通过那一部分计算出θ0和θ1的值
104
00:08:52,677 --> 00:08:57,469
the same time update theta0 and
theta1. So let me say what I mean
然后同时更新 θ0和θ1 让我进一步阐述这个过程
105
00:08:57,469 --> 00:09:02,024
by that. This is a correct implementation
of gradient descent meaning simultaneous
在梯度下降算法中 这是正确实现同时更新的方法
106
00:09:02,024 --> 00:09:06,461
updates. I'm going to set temp0 equals
that, set temp1 equals that. So basically
我要设 temp0等于这些 设temp1等于那些
107
00:09:06,461 --> 00:09:11,430
compute the right hand sides. And then having
computed the right hand sides and stored
所以首先计算出公式右边这一部分 然后将计算出的结果
108
00:09:11,430 --> 00:09:15,926
them together in temp0 and temp1,
I'm going to update theta0 and theta1
一起存入 temp0和 temp1 之中 然后同时更新 θ0和θ1
109
00:09:15,926 --> 00:09:20,245
simultaneously, because that's the
correct implementation. In contrast,
因为这才是正确的实现方法
110
00:09:20,245 --> 00:09:25,533
here's an incorrect implementation that
does not do a simultaneous update. So in
与此相反 下面是不正确的实现方法 因为它没有做到同步更新
111
00:09:25,533 --> 00:09:31,666
this incorrect implementation, we compute
temp0, and then we update theta0
在这种不正确的实现方法中 我们计算 temp0 然后我们更新θ0
112
00:09:31,666 --> 00:09:36,644
and then we compute temp1. Then we
update temp1. And the difference between
然后我们计算 temp1 然后我们将 temp1 赋给θ1
113
00:09:36,644 --> 00:09:41,877
the right hand side and the left hand side
implementations is that if we look down
右边的方法和左边的区别是 让我们看这里
114
00:09:41,877 --> 00:09:46,791
here, you look at this step, if by this
time you've already updated theta0
就是这一步 如果这个时候你已经更新了θ0
115
00:09:46,791 --> 00:09:51,897
then you would be using the new
value of theta0 to compute this
那么你会使用 θ0的新的值来计算这个微分项
116
00:09:51,897 --> 00:09:57,340
derivative term and so this gives you a
different value of temp1 than the left
所以由于你已经在这个公式中使用了新的 θ0的值
117
00:09:57,340 --> 00:10:01,565
hand side, because you've now
plugged in the new value of theta0
那么这会产生一个与左边不同的 temp1的值
118
00:10:01,565 --> 00:10:05,852
into this equation. And so this on right
hand side is not a correct implementation
所以右边并不是正确地实现梯度下降的做法
119
00:10:05,852 --> 00:10:09,916
of gradient descent. So I don't
want to say why you need to do the
我不打算解释为什么你需要同时更新
120
00:10:09,916 --> 00:10:14,617
simultaneous updates, it turns
out that the way gradient descent
同时更新是梯度下降中的一种常用方法
121
00:10:14,617 --> 00:10:18,735
is usually implemented, we'll say more
about it later, it actually turns out to
我们之后会讲到
122
00:10:18,735 --> 00:10:22,496
be more natural to implement the
simultaneous update. And when people talk
实际上同步更新是更自然的实现方法
123
00:10:22,496 --> 00:10:26,665
about gradient descent, they always mean
simultaneous update. If you implement the
当人们谈到梯度下降时 他们的意思就是同步更新
124
00:10:26,665 --> 00:10:30,630
non-simultaneous update, it turns out
it will probably work anyway, but this
如果用非同步更新去实现算法 代码可能也会正确工作
125
00:10:30,630 --> 00:10:34,747
algorithm on the right is not what people
people refer to as gradient descent and
但是右边的方法并不是人们所指的那个梯度下降算法
126
00:10:34,747 --> 00:10:38,356
this is some other algorithm with
different properties. And for various
而是具有不同性质的其他算法
127
00:10:38,356 --> 00:10:42,220
reasons, this can behave in
slightly stranger ways. And
由于各种原因 这其中会表现出微小的差别
128
00:10:42,220 --> 00:10:46,626
what you should do is to really
implement the simultaneous update of
你应该做的是 在梯度下降中真正实现同时更新
129
00:10:46,626 --> 00:10:52,313
gradient descent. So, that's the outline of the
gradient descent algorithm. In the next video,
这些就是梯度下降算法的梗概
130
00:10:52,313 --> 00:10:56,998
we're going to go into the details of the
derivative term, which I wrote out but
在接下来的视频中 我们要进入这个微分项的细节之中
131
00:10:56,998 --> 00:11:01,799
didn't really define. And if you've taken
a calculus class before and if you're
我已经写了出来但没有真正定义 如果你已经修过微积分课程
132
00:11:01,799 --> 00:11:06,367
familiar with partial derivatives and
derivatives, it turns out that's exactly
如果你熟悉偏导数和导数 这其实就是这个微分项
133
00:11:06,367 --> 00:11:11,425
what that derivative term is. But in case
you aren't familiar with calculus, don't
如果你不熟悉微积分 不用担心
134
00:11:11,425 --> 00:11:15,680
worry about it. The next video will give
you all the intuitions and will tell you
即使你之前没有看过微积分
135
00:11:15,680 --> 00:11:19,883
everything you need to know to compute
that derivative term, even if you haven't
或者没有接触过偏导数 在接下来的视频中
136
00:11:19,883 --> 00:11:24,296
seen calculus, or even if you haven't seen
partial derivatives before. And with
你会得到一切你需要知道的 如何计算这个微分项的知识
137
00:11:24,296 --> 00:11:28,288
that, with the next video, hopefully,
we'll be able to give all the intuitions
下一个视频中 希望我们能够给出
138
00:11:28,288 --> 00:11:30,180
you need to apply gradient descent.
实现梯度下降算法的所有知识
【果壳教育无边界字幕组】翻译:10号少年 校对:小白_远游 审核:所罗门捷列夫