srt/4 - 2 - Gradient Descent for Multiple Variables (5 min).srt

﻿1
00:00:00,220 --> 00:00:03,688
前一视频中，我们探讨了多元或多变量线性回归假设的形式
(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:00,220 --> 00:00:03,688
In the previous video, we talked about
the form of the hypothesis for linear

3
00:00:03,688 --> 00:00:07,246
前一视频中，我们探讨了多元或多变量线性回归假设的形式

4
00:00:03,688 --> 00:00:07,246
regression with multiple features
or with multiple variables.

5
00:00:07,246 --> 00:00:11,912
在这个视频中，我们将介绍如何设定该假设的参数

6
00:00:07,246 --> 00:00:11,912
In this video, let's talk about how to
fit the parameters of that hypothesis.

7
00:00:11,912 --> 00:00:15,175
特别是，我们会讲解如何使用梯度下降法来

8
00:00:11,912 --> 00:00:15,175
In particular let's talk about how
to use gradient descent for linear

9
00:00:15,175 --> 00:00:19,875
处理多元线性回归

10
00:00:15,175 --> 00:00:19,875
regression with multiple features.

11
00:00:19,875 --> 00:00:24,802
快速地总结下我们的变量记号，得到正式的多元线性回归假设

12
00:00:19,875 --> 00:00:24,802
To quickly summarize our notation,
this is our formal hypothesis in

13
00:00:24,802 --> 00:00:31,509
其中，我们已经按惯例，使x0 = 1

14
00:00:24,802 --> 00:00:31,509
multivariable linear regression where
we've adopted the convention that x0=1.

15
00:00:31,509 --> 00:00:37,505
此模型的参数包括从 theta0 到 theta n，但我们不把它看作

16
00:00:31,509 --> 00:00:37,505
The parameters of this model are theta0
through theta n, but instead of thinking

17
00:00:37,505 --> 00:00:42,385
这 n 个独立有效的参数。而是考虑

18
00:00:37,505 --> 00:00:42,385
of this as n separate parameters, which
is valid, I'm instead going to think of

19
00:00:42,385 --> 00:00:51,175
把这些theta参数作为一个 n + 1 维的向量。

20
00:00:42,385 --> 00:00:51,175
the parameters as theta where theta
here is a n+1-dimensional vector.

21
00:00:51,175 --> 00:00:55,498
所以，我只会把此模型的参数看作模型自己的一个向量。

22
00:00:51,175 --> 00:00:55,498
So I'm just going to think of the
parameters of this model

23
00:00:55,498 --> 00:00:58,674
所以，我只会把此模型的参数看作模型自己的一个向量。

24
00:00:55,498 --> 00:00:58,674
as itself being a vector.

25
00:00:58,674 --> 00:01:03,507
我们的成本函数是从 theta0 到 theta n 的J，它通过误差项的

26
00:00:58,674 --> 00:01:03,507
Our cost function is J of theta0 through
theta n which is given by this usual

27
00:01:03,507 --> 00:01:08,983
平方的总和来给定。但又不把 J 看作带n+1个数的函数

28
00:01:03,507 --> 00:01:08,983
sum of square of error term. But again
instead of thinking of J as a function

29
00:01:08,983 --> 00:01:14,016
我会使用更通用的方式把 J 看作是参数为theta向量的函数。

30
00:01:08,983 --> 00:01:14,016
of these n+1 numbers, I'm going to
more commonly write J as just a

31
00:01:14,016 --> 00:01:22,275
所以， theta 在这里还是一个向量。

32
00:01:14,016 --> 00:01:22,275
function of the parameter vector theta
so that theta here is a vector.

33
00:01:22,275 --> 00:01:26,897
这就是梯度下降的样子。我们要不断更新每个theta j 参数，

34
00:01:22,275 --> 00:01:26,897
Here's what gradient descent looks like.
We're going to repeatedly update each

35
00:01:26,897 --> 00:01:32,142
（不是我懒，而是用自然语言来描述数学公式，老是有歧义，具体看视频上显示的公式吧）

36
00:01:26,897 --> 00:01:32,142
parameter theta j according to theta j
minus alpha times this derivative term.

37
00:01:32,142 --> 00:01:37,868
我们再把这写作 theta 的  J，然后 theta j 更新为

38
00:01:32,142 --> 00:01:37,868
And once again we just write this as
J of theta, so theta j is updated as

39
00:01:37,868 --> 00:01:41,840
（不是我懒，而是用自然语言来描述数学公式，老是有歧义，具体看视频上显示的公式吧）

40
00:01:37,868 --> 00:01:41,840
theta j minus the learning rate
alpha times the derivative, a partial

41
00:01:41,840 --> 00:01:47,840
这个求导是成本函数对参数theta j 的求偏导。

42
00:01:41,840 --> 00:01:47,840
derivative of the cost function with
respect to the parameter theta j.

43
00:01:47,840 --> 00:01:51,305
让我们看看这是什么样子时我们实施梯度下降，

44
00:01:47,840 --> 00:01:51,305
Let's see what this looks like when
we implement gradient descent and,

45
00:01:51,305 --> 00:01:55,985
特别是，让我们去看看，偏导数看起来像什么。

46
00:01:51,305 --> 00:01:55,985
in particular, let's go see what that
partial derivative term looks like.

47
00:01:55,985 --> 00:02:01,383
这就是我们使用梯度下降法，并且 N = 1 元时的例子。

48
00:01:55,985 --> 00:02:01,383
Here's what we have for gradient descent
for the case of when we had N=1 feature.

49
00:02:01,383 --> 00:02:06,782
我们有两个独立的更新规则，分别对应参数 theta0 和 theta1

50
00:02:01,383 --> 00:02:06,782
We had two separate update rules for
the parameters theta0 and theta1, and

51
00:02:06,782 --> 00:02:12,779
希望你熟悉这些内容。在这里的这个项，当然就是

52
00:02:06,782 --> 00:02:12,779
hopefully these look familiar to you.
And this term here was of course the

53
00:02:12,779 --> 00:02:17,672
成本函数 对参数 theta0 求的偏导，

54
00:02:12,779 --> 00:02:17,672
partial derivative of the cost function
with respect to the parameter of theta0,

55
00:02:17,672 --> 00:02:21,891
同样地，我们还有一个不同的参数 theta1 的更新规则。

56
00:02:17,672 --> 00:02:21,891
and similarly we had a different
update rule for the parameter theta1.

57
00:02:21,891 --> 00:02:26,259
有一个小小的区别，我们以前只有一元特征值

58
00:02:21,891 --> 00:02:26,259
There's one little difference which is
that when we previously had only one

59
00:02:26,259 --> 00:02:31,992
我们可以把它叫做x(i)，但现在在我们的新的记号表示法中

60
00:02:26,259 --> 00:02:31,992
feature, we would call that feature x(i)
but now in our new notation

61
00:02:31,992 --> 00:02:38,462
我们自然而然把他们称之为（看视频），以表示一个特征值。

62
00:02:31,992 --> 00:02:38,462
we would of course call this 
x(i)<u>1 to denote our one feature.</u>

63
00:02:38,462 --> 00:02:41,019
所以在我们只有一个特征值的情况下，就是这样。

64
00:02:38,462 --> 00:02:41,019
So that was for when
we had only one feature.

65
00:02:41,019 --> 00:02:44,496
让我们看看新的算法，我们有多于一个的特征值，

66
00:02:41,019 --> 00:02:44,496
Let's look at the new algorithm for
we have more than one feature,

67
00:02:44,496 --> 00:02:47,350
特征值个数 n 可能比 1 大得多。

68
00:02:44,496 --> 00:02:47,350
where the number of features n
may be much larger than one.

69
00:02:47,350 --> 00:02:53,158
我们得到这个的梯度下降法更新规则，也许对于你们当中，

70
00:02:47,350 --> 00:02:53,158
We get this update rule for gradient
descent and, maybe for those of you that

71
00:02:53,158 --> 00:02:57,781
会微积分的人来说，如果你根据成本函数的定义，然后

72
00:02:53,158 --> 00:02:57,781
know calculus, if you take the
definition of the cost function and take

73
00:02:57,781 --> 00:03:03,312
计算 成本 函数J 对参数 theta j 的偏导，

74
00:02:57,781 --> 00:03:03,312
the partial derivative of the cost
function J with respect to the parameter

75
00:03:03,312 --> 00:03:08,119
你就会发现，这偏导值就是

76
00:03:03,312 --> 00:03:08,119
theta j, you'll find that that partial
derivative is exactly that term that

77
00:03:08,119 --> 00:03:10,665
我在它周围画上蓝框的项。

78
00:03:08,119 --> 00:03:10,665
I've drawn the blue box around.

79
00:03:10,665 --> 00:03:14,837
如果你这样做了，你就得到梯度下降法的具体实现，

80
00:03:10,665 --> 00:03:14,837
And if you implement this you will
get a working implementation of

81
00:03:14,837 --> 00:03:18,962
用于多元线性回归

82
00:03:14,837 --> 00:03:18,962
gradient descent for
multivariate linear regression.

83
00:03:18,962 --> 00:03:21,572
在此幻灯片中，我想做的最后一件事，就是告诉你

84
00:03:18,962 --> 00:03:21,572
The last thing I want to do on
this slide is give you a sense of

85
00:03:21,572 --> 00:03:26,882
为何这些或新或旧的算法是同一类事件，或为何他们

86
00:03:21,572 --> 00:03:26,882
why these new and old algorithms are
sort of the same thing or why they're

87
00:03:26,882 --> 00:03:30,904
是类似的算法，以及为何他们都是梯度下降算法。

88
00:03:26,882 --> 00:03:30,904
both similar algorithms or why they're
both gradient descent algorithms.

89
00:03:30,904 --> 00:03:34,363
让我们来看个例子，现在有两个特征值，

90
00:03:30,904 --> 00:03:34,363
Let's consider a case
where we have two features

91
00:03:34,363 --> 00:03:37,488
或者超过两个的特征值，因此，我们有三条更新规则

92
00:03:34,363 --> 00:03:37,488
or maybe more than two features,
so we have three update rules for

93
00:03:37,488 --> 00:03:42,680
来计算参数 theta0 到 theta2 ，可能其他的 theta值也一样。

94
00:03:37,488 --> 00:03:42,680
the parameters theta0, theta1, theta2
and maybe other values of theta as well.

95
00:03:42,680 --> 00:03:49,457
如果你观察theta0的更新规则，你会发现，

96
00:03:42,680 --> 00:03:49,457
If you look at the update rule for
theta0, what you find is that this

97
00:03:49,457 --> 00:03:55,300
这更新规则和我们以前用过的更新规则一样

98
00:03:49,457 --> 00:03:55,300
update rule here is the same as
the update rule that we had previously

99
00:03:55,300 --> 00:03:57,350
以前那个 n = 1 的例子的更新规则。

100
00:03:55,300 --> 00:03:57,350
for the case of n = 1.

101
00:03:57,350 --> 00:04:00,203
当然，它们等效的原因是

102
00:03:57,350 --> 00:04:00,203
And the reason that they are
equivalent is, of course,

103
00:04:00,203 --> 00:04:06,871
因为在我们符号惯例中，我们有 x (i) <u>0 = 1 的约定，</u>

104
00:04:00,203 --> 00:04:06,871
because in our notational convention we
had this x(i)<u>0 = 1 convention, which is</u>

105
00:04:06,871 --> 00:04:12,003
这就是为什么洋红色框里左右两边会等效。

106
00:04:06,871 --> 00:04:12,003
why these two term that I've drawn the
magenta boxes around are equivalent.

107
00:04:12,003 --> 00:04:16,010
同样地，如果你注意到 theta1 的更新规则，你会发现

108
00:04:12,003 --> 00:04:16,010
Similarly, if you look the update
rule for theta1, you find that

109
00:04:16,010 --> 00:04:21,540
这一项等效于我们以前用过的项

110
00:04:16,010 --> 00:04:21,540
this term here is equivalent to
the term we previously had,

111
00:04:21,540 --> 00:04:25,020
或者说方程，或更新规则，我们曾用于以前的 theta1

112
00:04:21,540 --> 00:04:25,020
or the equation or the update
rule we previously had for theta1,

113
00:04:25,020 --> 00:04:30,222
当然我们只是使用了这种新的符号 x (i) <u>1 来表示</u>

114
00:04:25,020 --> 00:04:30,222
where of course we're just using
this new notation x(i)<u>1 to denote</u>

115
00:04:30,222 --> 00:04:37,605
我们第一元特征值，现在，我们有多个特征值，

116
00:04:30,222 --> 00:04:37,605
our first feature, and now that we have
more than one feature we can have

117
00:04:37,605 --> 00:04:43,560
于是我们有相似的更新规则，用于诸如 theta2 等参数。

118
00:04:37,605 --> 00:04:43,560
similar update rules for the other
parameters like theta2 and so on.

119
00:04:43,560 --> 00:04:48,219
此幻灯片中还有很多内容，所以我无比明确地鼓励你

120
00:04:43,560 --> 00:04:48,219
There's a lot going on on this slide
so I definitely encourage you

121
00:04:48,219 --> 00:04:52,020
去暂停视频，一丝不苟地观看这张幻灯片上的数学内容

122
00:04:48,219 --> 00:04:52,020
if you need to to pause the video
and look at all the math on this slide

123
00:04:52,020 --> 00:04:55,446
以确保您掌握了这上面的一切。

124
00:04:52,020 --> 00:04:55,446
slowly to make sure you understand
everything that's going on here.

125
00:04:55,446 --> 00:05:00,440
但是，如果你实现了写在这里的算法，

126
00:04:55,446 --> 00:05:00,440
But if you implement the algorithm
written up here then you have

127
00:05:00,440 --> 00:05:51,300
那么你就已经拥有一个多元线性回归的具体实现。

128
00:05:00,440 --> 00:05:51,300
a working implementation of linear
regression with multiple features.