forked from fengdu78/Coursera-ML-AndrewNg-Notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path4 - 2 - Gradient Descent for Multiple Variables (5 min).srt
573 lines (445 loc) · 12.6 KB
/
4 - 2 - Gradient Descent for Multiple Variables (5 min).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
1
00:00:00,220 --> 00:00:03,688
前一视频中,我们探讨了多元或多变量线性回归假设的形式
(字幕整理:中国海洋大学 黄海广,haiguang2000@qq.com )
2
00:00:00,220 --> 00:00:03,688
In the previous video, we talked about
the form of the hypothesis for linear
3
00:00:03,688 --> 00:00:07,246
前一视频中,我们探讨了多元或多变量线性回归假设的形式
4
00:00:03,688 --> 00:00:07,246
regression with multiple features
or with multiple variables.
5
00:00:07,246 --> 00:00:11,912
在这个视频中,我们将介绍如何设定该假设的参数
6
00:00:07,246 --> 00:00:11,912
In this video, let's talk about how to
fit the parameters of that hypothesis.
7
00:00:11,912 --> 00:00:15,175
特别是,我们会讲解如何使用梯度下降法来
8
00:00:11,912 --> 00:00:15,175
In particular let's talk about how
to use gradient descent for linear
9
00:00:15,175 --> 00:00:19,875
处理多元线性回归
10
00:00:15,175 --> 00:00:19,875
regression with multiple features.
11
00:00:19,875 --> 00:00:24,802
快速地总结下我们的变量记号,得到正式的多元线性回归假设
12
00:00:19,875 --> 00:00:24,802
To quickly summarize our notation,
this is our formal hypothesis in
13
00:00:24,802 --> 00:00:31,509
其中,我们已经按惯例,使x0 = 1
14
00:00:24,802 --> 00:00:31,509
multivariable linear regression where
we've adopted the convention that x0=1.
15
00:00:31,509 --> 00:00:37,505
此模型的参数包括从 theta0 到 theta n,但我们不把它看作
16
00:00:31,509 --> 00:00:37,505
The parameters of this model are theta0
through theta n, but instead of thinking
17
00:00:37,505 --> 00:00:42,385
这 n 个独立有效的参数。而是考虑
18
00:00:37,505 --> 00:00:42,385
of this as n separate parameters, which
is valid, I'm instead going to think of
19
00:00:42,385 --> 00:00:51,175
把这些theta参数作为一个 n + 1 维的向量。
20
00:00:42,385 --> 00:00:51,175
the parameters as theta where theta
here is a n+1-dimensional vector.
21
00:00:51,175 --> 00:00:55,498
所以,我只会把此模型的参数看作模型自己的一个向量。
22
00:00:51,175 --> 00:00:55,498
So I'm just going to think of the
parameters of this model
23
00:00:55,498 --> 00:00:58,674
所以,我只会把此模型的参数看作模型自己的一个向量。
24
00:00:55,498 --> 00:00:58,674
as itself being a vector.
25
00:00:58,674 --> 00:01:03,507
我们的成本函数是从 theta0 到 theta n 的J,它通过误差项的
26
00:00:58,674 --> 00:01:03,507
Our cost function is J of theta0 through
theta n which is given by this usual
27
00:01:03,507 --> 00:01:08,983
平方的总和来给定。但又不把 J 看作带n+1个数的函数
28
00:01:03,507 --> 00:01:08,983
sum of square of error term. But again
instead of thinking of J as a function
29
00:01:08,983 --> 00:01:14,016
我会使用更通用的方式把 J 看作是参数为theta向量的函数。
30
00:01:08,983 --> 00:01:14,016
of these n+1 numbers, I'm going to
more commonly write J as just a
31
00:01:14,016 --> 00:01:22,275
所以, theta 在这里还是一个向量。
32
00:01:14,016 --> 00:01:22,275
function of the parameter vector theta
so that theta here is a vector.
33
00:01:22,275 --> 00:01:26,897
这就是梯度下降的样子。我们要不断更新每个theta j 参数,
34
00:01:22,275 --> 00:01:26,897
Here's what gradient descent looks like.
We're going to repeatedly update each
35
00:01:26,897 --> 00:01:32,142
(不是我懒,而是用自然语言来描述数学公式,老是有歧义,具体看视频上显示的公式吧)
36
00:01:26,897 --> 00:01:32,142
parameter theta j according to theta j
minus alpha times this derivative term.
37
00:01:32,142 --> 00:01:37,868
我们再把这写作 theta 的 J,然后 theta j 更新为
38
00:01:32,142 --> 00:01:37,868
And once again we just write this as
J of theta, so theta j is updated as
39
00:01:37,868 --> 00:01:41,840
(不是我懒,而是用自然语言来描述数学公式,老是有歧义,具体看视频上显示的公式吧)
40
00:01:37,868 --> 00:01:41,840
theta j minus the learning rate
alpha times the derivative, a partial
41
00:01:41,840 --> 00:01:47,840
这个求导是成本函数对参数theta j 的求偏导。
42
00:01:41,840 --> 00:01:47,840
derivative of the cost function with
respect to the parameter theta j.
43
00:01:47,840 --> 00:01:51,305
让我们看看这是什么样子时我们实施梯度下降,
44
00:01:47,840 --> 00:01:51,305
Let's see what this looks like when
we implement gradient descent and,
45
00:01:51,305 --> 00:01:55,985
特别是,让我们去看看,偏导数看起来像什么。
46
00:01:51,305 --> 00:01:55,985
in particular, let's go see what that
partial derivative term looks like.
47
00:01:55,985 --> 00:02:01,383
这就是我们使用梯度下降法,并且 N = 1 元时的例子。
48
00:01:55,985 --> 00:02:01,383
Here's what we have for gradient descent
for the case of when we had N=1 feature.
49
00:02:01,383 --> 00:02:06,782
我们有两个独立的更新规则,分别对应参数 theta0 和 theta1
50
00:02:01,383 --> 00:02:06,782
We had two separate update rules for
the parameters theta0 and theta1, and
51
00:02:06,782 --> 00:02:12,779
希望你熟悉这些内容。在这里的这个项,当然就是
52
00:02:06,782 --> 00:02:12,779
hopefully these look familiar to you.
And this term here was of course the
53
00:02:12,779 --> 00:02:17,672
成本函数 对参数 theta0 求的偏导,
54
00:02:12,779 --> 00:02:17,672
partial derivative of the cost function
with respect to the parameter of theta0,
55
00:02:17,672 --> 00:02:21,891
同样地,我们还有一个不同的参数 theta1 的更新规则。
56
00:02:17,672 --> 00:02:21,891
and similarly we had a different
update rule for the parameter theta1.
57
00:02:21,891 --> 00:02:26,259
有一个小小的区别,我们以前只有一元特征值
58
00:02:21,891 --> 00:02:26,259
There's one little difference which is
that when we previously had only one
59
00:02:26,259 --> 00:02:31,992
我们可以把它叫做x(i),但现在在我们的新的记号表示法中
60
00:02:26,259 --> 00:02:31,992
feature, we would call that feature x(i)
but now in our new notation
61
00:02:31,992 --> 00:02:38,462
我们自然而然把他们称之为(看视频),以表示一个特征值。
62
00:02:31,992 --> 00:02:38,462
we would of course call this
x(i)<u>1 to denote our one feature.</u>
63
00:02:38,462 --> 00:02:41,019
所以在我们只有一个特征值的情况下,就是这样。
64
00:02:38,462 --> 00:02:41,019
So that was for when
we had only one feature.
65
00:02:41,019 --> 00:02:44,496
让我们看看新的算法,我们有多于一个的特征值,
66
00:02:41,019 --> 00:02:44,496
Let's look at the new algorithm for
we have more than one feature,
67
00:02:44,496 --> 00:02:47,350
特征值个数 n 可能比 1 大得多。
68
00:02:44,496 --> 00:02:47,350
where the number of features n
may be much larger than one.
69
00:02:47,350 --> 00:02:53,158
我们得到这个的梯度下降法更新规则,也许对于你们当中,
70
00:02:47,350 --> 00:02:53,158
We get this update rule for gradient
descent and, maybe for those of you that
71
00:02:53,158 --> 00:02:57,781
会微积分的人来说,如果你根据成本函数的定义,然后
72
00:02:53,158 --> 00:02:57,781
know calculus, if you take the
definition of the cost function and take
73
00:02:57,781 --> 00:03:03,312
计算 成本 函数J 对参数 theta j 的偏导,
74
00:02:57,781 --> 00:03:03,312
the partial derivative of the cost
function J with respect to the parameter
75
00:03:03,312 --> 00:03:08,119
你就会发现,这偏导值就是
76
00:03:03,312 --> 00:03:08,119
theta j, you'll find that that partial
derivative is exactly that term that
77
00:03:08,119 --> 00:03:10,665
我在它周围画上蓝框的项。
78
00:03:08,119 --> 00:03:10,665
I've drawn the blue box around.
79
00:03:10,665 --> 00:03:14,837
如果你这样做了,你就得到梯度下降法的具体实现,
80
00:03:10,665 --> 00:03:14,837
And if you implement this you will
get a working implementation of
81
00:03:14,837 --> 00:03:18,962
用于多元线性回归
82
00:03:14,837 --> 00:03:18,962
gradient descent for
multivariate linear regression.
83
00:03:18,962 --> 00:03:21,572
在此幻灯片中,我想做的最后一件事,就是告诉你
84
00:03:18,962 --> 00:03:21,572
The last thing I want to do on
this slide is give you a sense of
85
00:03:21,572 --> 00:03:26,882
为何这些或新或旧的算法是同一类事件,或为何他们
86
00:03:21,572 --> 00:03:26,882
why these new and old algorithms are
sort of the same thing or why they're
87
00:03:26,882 --> 00:03:30,904
是类似的算法,以及为何他们都是梯度下降算法。
88
00:03:26,882 --> 00:03:30,904
both similar algorithms or why they're
both gradient descent algorithms.
89
00:03:30,904 --> 00:03:34,363
让我们来看个例子,现在有两个特征值,
90
00:03:30,904 --> 00:03:34,363
Let's consider a case
where we have two features
91
00:03:34,363 --> 00:03:37,488
或者超过两个的特征值,因此,我们有三条更新规则
92
00:03:34,363 --> 00:03:37,488
or maybe more than two features,
so we have three update rules for
93
00:03:37,488 --> 00:03:42,680
来计算参数 theta0 到 theta2 ,可能其他的 theta值也一样。
94
00:03:37,488 --> 00:03:42,680
the parameters theta0, theta1, theta2
and maybe other values of theta as well.
95
00:03:42,680 --> 00:03:49,457
如果你观察theta0的更新规则,你会发现,
96
00:03:42,680 --> 00:03:49,457
If you look at the update rule for
theta0, what you find is that this
97
00:03:49,457 --> 00:03:55,300
这更新规则和我们以前用过的更新规则一样
98
00:03:49,457 --> 00:03:55,300
update rule here is the same as
the update rule that we had previously
99
00:03:55,300 --> 00:03:57,350
以前那个 n = 1 的例子的更新规则。
100
00:03:55,300 --> 00:03:57,350
for the case of n = 1.
101
00:03:57,350 --> 00:04:00,203
当然,它们等效的原因是
102
00:03:57,350 --> 00:04:00,203
And the reason that they are
equivalent is, of course,
103
00:04:00,203 --> 00:04:06,871
因为在我们符号惯例中,我们有 x (i) <u>0 = 1 的约定,</u>
104
00:04:00,203 --> 00:04:06,871
because in our notational convention we
had this x(i)<u>0 = 1 convention, which is</u>
105
00:04:06,871 --> 00:04:12,003
这就是为什么洋红色框里左右两边会等效。
106
00:04:06,871 --> 00:04:12,003
why these two term that I've drawn the
magenta boxes around are equivalent.
107
00:04:12,003 --> 00:04:16,010
同样地,如果你注意到 theta1 的更新规则,你会发现
108
00:04:12,003 --> 00:04:16,010
Similarly, if you look the update
rule for theta1, you find that
109
00:04:16,010 --> 00:04:21,540
这一项等效于我们以前用过的项
110
00:04:16,010 --> 00:04:21,540
this term here is equivalent to
the term we previously had,
111
00:04:21,540 --> 00:04:25,020
或者说方程,或更新规则,我们曾用于以前的 theta1
112
00:04:21,540 --> 00:04:25,020
or the equation or the update
rule we previously had for theta1,
113
00:04:25,020 --> 00:04:30,222
当然我们只是使用了这种新的符号 x (i) <u>1 来表示</u>
114
00:04:25,020 --> 00:04:30,222
where of course we're just using
this new notation x(i)<u>1 to denote</u>
115
00:04:30,222 --> 00:04:37,605
我们第一元特征值,现在,我们有多个特征值,
116
00:04:30,222 --> 00:04:37,605
our first feature, and now that we have
more than one feature we can have
117
00:04:37,605 --> 00:04:43,560
于是我们有相似的更新规则,用于诸如 theta2 等参数。
118
00:04:37,605 --> 00:04:43,560
similar update rules for the other
parameters like theta2 and so on.
119
00:04:43,560 --> 00:04:48,219
此幻灯片中还有很多内容,所以我无比明确地鼓励你
120
00:04:43,560 --> 00:04:48,219
There's a lot going on on this slide
so I definitely encourage you
121
00:04:48,219 --> 00:04:52,020
去暂停视频,一丝不苟地观看这张幻灯片上的数学内容
122
00:04:48,219 --> 00:04:52,020
if you need to to pause the video
and look at all the math on this slide
123
00:04:52,020 --> 00:04:55,446
以确保您掌握了这上面的一切。
124
00:04:52,020 --> 00:04:55,446
slowly to make sure you understand
everything that's going on here.
125
00:04:55,446 --> 00:05:00,440
但是,如果你实现了写在这里的算法,
126
00:04:55,446 --> 00:05:00,440
But if you implement the algorithm
written up here then you have
127
00:05:00,440 --> 00:05:51,300
那么你就已经拥有一个多元线性回归的具体实现。
128
00:05:00,440 --> 00:05:51,300
a working implementation of linear
regression with multiple features.