srt/7 - 3 - Regularized Linear Regression (11 min).srt

﻿1
00:00:00,260 --> 00:00:01,490
For linear regression, we had
对于线性回归的求解 我们之前
(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:01,680 --> 00:00:03,130
previously worked out two learning
推导了两种学习算法

3
00:00:03,490 --> 00:00:05,010
algorithms, one based on
一种基于梯度下降

4
00:00:05,180 --> 00:00:07,650
gradient descent and one based on the normal equation.
一种基于正规方程

5
00:00:08,750 --> 00:00:09,740
In this video we will take
在这段视频中 我们将继续学习

6
00:00:09,890 --> 00:00:11,640
those two algorithms and generalize
这两个算法 并把它们推广

7
00:00:12,290 --> 00:00:13,380
them to the case of regularized
到正则化线性回归中去

8
00:00:14,330 --> 00:00:17,640
linear regression. Here's the
这是我们上节课推导出的

9
00:00:18,100 --> 00:00:19,540
optimization objective, that we
正则化线性回归的

10
00:00:20,200 --> 00:00:22,380
came up with last time for regularized linear regression.
优化目标

11
00:00:23,360 --> 00:00:24,580
This first part is our
前面的第一部分是

12
00:00:24,980 --> 00:00:27,240
usual, objective for linear regression,
一般线性回归的目标函数

13
00:00:28,170 --> 00:00:29,300
and we now have this additional
而现在我们有这个额外的

14
00:00:30,200 --> 00:00:31,750
regularization term, where londer
正则化项 其中 λ

15
00:00:32,450 --> 00:00:34,960
is our regularization parameter, and
是正则化参数

16
00:00:35,220 --> 00:00:36,690
we like to find parameters theta,
我们想找到参数 θ

17
00:00:37,160 --> 00:00:38,550
that minimizes this cost function,
能最小化代价函数

18
00:00:39,030 --> 00:00:41,280
this regularized cost function, J of theta.
即这个正则化代价函??数 J(θ)

19
00:00:41,840 --> 00:00:43,030
Previously, we were using
之前 我们使用

20
00:00:43,440 --> 00:00:45,180
gradient descent for the original
梯度下降求解原来

21
00:00:46,620 --> 00:00:48,060
cost function, without the regularization
没有正则项的代价函数

22
00:00:48,770 --> 00:00:49,820
term, and we had
我们用

23
00:00:50,060 --> 00:00:51,990
the following algorithm for regular
下面的算法求解常规的

24
00:00:52,370 --> 00:00:53,620
linear regression, without regularization.
没有正则项的线性回归

25
00:00:54,660 --> 00:00:56,260
We will repeatedly update the
我们会如此反复更新

26
00:00:56,330 --> 00:00:57,670
parameters theta J as follows
参数 θj

27
00:00:58,270 --> 00:01:00,030
for J equals 1,2 up
其中 j=0, 1, 2...n

28
00:01:00,400 --> 00:01:02,110
through n. Let me
让我

29
00:01:02,530 --> 00:01:03,960
take this and just write
照这个把

30
00:01:04,240 --> 00:01:06,580
the case for theta zero separately.
j=0 即 θ0 的情况单独写出来

31
00:01:07,210 --> 00:01:08,400
So, you know, I'm just gonna
我只是把

32
00:01:08,720 --> 00:01:09,900
write the update for theta
θ0 的更新

33
00:01:10,160 --> 00:01:12,500
zero separately, then for
分离出来

34
00:01:12,680 --> 00:01:14,380
the update for the parameters
剩下的这些参数θ1, θ2 到θn的更新

35
00:01:14,780 --> 00:01:17,090
1, 2, 3, and so on up
作为另一部分

36
00:01:17,370 --> 00:01:19,760
to n. So, I haven't changed anything yet, right?
所以 这样做其实没有什么变化 对吧？

37
00:01:19,970 --> 00:01:21,070
This is just writing the update
这只是把 θ0 的更新

38
00:01:21,300 --> 00:01:23,300
for theta zero separately from the
这只是把 θ0 的更新

39
00:01:23,550 --> 00:01:25,240
updates from theta 1, theta
和 θ1 θ2 到 θn 的更新分离开来

40
00:01:25,510 --> 00:01:26,980
2, theta 3, up to theta n. And
和 θ1 θ2 到 θn 的更新分离开来

41
00:01:27,040 --> 00:01:27,900
the reason I want to do this
我这样做的原因是

42
00:01:28,230 --> 00:01:29,320
is you may remember
你可能还记得

43
00:01:29,880 --> 00:01:31,260
that for our regularized linear regression,
对于正则化的线性回归

44
00:01:32,620 --> 00:01:33,970
we penalize the parameters theta
我们惩罚参数θ1

45
00:01:34,440 --> 00:01:35,540
1, theta 2, and so
θ2...一直到

46
00:01:35,860 --> 00:01:38,360
on up to theta n, but we don't penalize theta zero.
θn  但是我们不惩罚θ0

47
00:01:38,820 --> 00:01:40,250
So when we modify this
所以 当我们修改这个

48
00:01:40,410 --> 00:01:42,400
algorithm for regularized
正则化线性回归的算法时

49
00:01:42,750 --> 00:01:44,050
linear regression, we're going to
我们将对

50
00:01:44,710 --> 00:01:46,870
end up treating theta zero slightly differently.
θ0 的方式将有所不同

51
00:01:48,560 --> 00:01:50,360
Concretely, if we
具体地说 如果我们

52
00:01:50,500 --> 00:01:52,170
want to take this algorithm and
要对这个算法进行

53
00:01:52,300 --> 00:01:53,780
modify it to use the
修改 并用它

54
00:01:53,870 --> 00:01:55,630
regularized objective, all we
求解正则化的目标函数 我们

55
00:01:55,740 --> 00:01:57,170
need to do is take this
需要做的是

56
00:01:57,350 --> 00:02:00,010
term at the bottom and modify as follows.
把下边的这一项做如下的修改

57
00:02:00,460 --> 00:02:01,860
We're gonna take this term and add
我们要在这一项上添加一项:

58
00:02:02,670 --> 00:02:05,310
minus londer M,
λ 除以 m

59
00:02:06,330 --> 00:02:08,920
times theta J. And
再乘以 θj

60
00:02:09,100 --> 00:02:10,850
if you implement this, then you
如果这样做的话 那么你就有了

61
00:02:11,000 --> 00:02:13,230
have gradient descent for trying
用于最小化

62
00:02:13,960 --> 00:02:15,920
to minimize the regularized cost
正则化代价函数 J(θ)

63
00:02:16,160 --> 00:02:18,200
function J of F theta, and concretely,
的梯度下降算法

64
00:02:19,520 --> 00:02:20,570
I'm not gonna do the
我不打算用

65
00:02:20,680 --> 00:02:22,260
calculus to prove it, but
微积分来证明这一点

66
00:02:22,390 --> 00:02:23,480
concretely if you look
但如果你看这一项

67
00:02:23,690 --> 00:02:26,580
at this term, this term that's written is square brackets.
方括号里的这一项

68
00:02:27,730 --> 00:02:28,930
If you know calculus, it's possible
如果你知道微积分

69
00:02:29,380 --> 00:02:31,150
to prove that that term is
应该不难证明它是

70
00:02:31,370 --> 00:02:33,150
the partial derivative, with respect of
J(θ) 对 θj 的偏导数

71
00:02:33,980 --> 00:02:35,400
J of theta, using the new
这里的 J(θ) 是用的新定义的形式

72
00:02:35,660 --> 00:02:37,520
definition of J of theta
它的定义中

73
00:02:38,140 --> 00:02:39,330
with the regularization term.
包含正则化项

74
00:02:39,510 --> 00:02:42,490
And somebody on this
而另一项

75
00:02:42,760 --> 00:02:43,960
term up on top,
上面的这一项

76
00:02:44,750 --> 00:02:45,570
which I guess I am
我用青色的方框

77
00:02:45,680 --> 00:02:47,240
drawing the salient box
圈出来的这一项

78
00:02:48,000 --> 00:02:49,270
that's still the partial derivative
这也一个是偏导数

79
00:02:49,940 --> 00:02:52,700
respect of theta zero for J of theta.
是 J(θ)对 θ0 的偏导数

80
00:02:53,680 --> 00:02:54,900
If you look at the update for
如果你仔细看 θj 的更新

81
00:02:55,600 --> 00:02:56,710
theta J, it's possible to
你会发现一些

82
00:02:56,910 --> 00:02:59,190
show's something pretty interesting, concretely
有趣的东西 具体来说

83
00:02:59,860 --> 00:03:01,100
theta J gets updated as
θj 的每次更新

84
00:03:01,280 --> 00:03:03,400
theta J, minus alpha times,
都是 θj 自己减去 α 乘以原来的无正则项

85
00:03:04,090 --> 00:03:05,010
and then you have this other term
然后还有这另外的一项

86
00:03:05,380 --> 00:03:06,730
here that depends on theta J
这一项的大小也取决于 θj

87
00:03:06,910 --> 00:03:08,310
. So if you
所以 如果你

88
00:03:08,420 --> 00:03:09,410
group all the terms together
把所有这些

89
00:03:10,030 --> 00:03:11,690
that depending on theta J. We
取决于 θj 的合在一起的话

90
00:03:11,780 --> 00:03:13,190
can show that this update can
可以证明 这个更新

91
00:03:13,670 --> 00:03:15,100
be written equivalently as
可以等价地写为

92
00:03:15,200 --> 00:03:16,160
follows and all I did
如下的形式

93
00:03:16,470 --> 00:03:17,620
was have, you know, theta J
具体来讲 上面的 θj

94
00:03:18,310 --> 00:03:20,100
here is theta J times
对应下面的 θj 乘以括号里的1

95
00:03:20,450 --> 00:03:21,950
1 and this term is
而这一项是

96
00:03:22,910 --> 00:03:24,830
londer over M. There's also an alpha
λ 除以 m 还有一个α

97
00:03:25,140 --> 00:03:25,990
here, so you end up
把它们合在一起 所以你最终得到

98
00:03:26,180 --> 00:03:27,650
with alpha londer over
α 乘以 λ 再除以 m

99
00:03:27,970 --> 00:03:31,450
m, multiply them to
然后合在一起 乘以 θj

100
00:03:31,820 --> 00:03:33,660
theta J and this term here, one minus
而这一项

101
00:03:34,230 --> 00:03:36,300
alpha times londer M, is
1 减去 α 乘以 λ 除以 m

102
00:03:36,600 --> 00:03:39,470
a pretty interesting term, it has a pretty interesting effect.
这一项很有意思

103
00:03:42,310 --> 00:03:43,710
Concretely, this term one
具体来说 这一项

104
00:03:43,890 --> 00:03:45,320
minus alpha times londer over
1 减去 α 乘以 λ 除以 m

105
00:03:45,730 --> 00:03:46,780
M, is going to be
这一项的值

106
00:03:46,870 --> 00:03:48,740
a number that's, you know, usually a number
通常是一个具体的实数

107
00:03:48,800 --> 00:03:50,390
that's a loop and less than 1,
而且小于1

108
00:03:50,610 --> 00:03:51,670
right? Because of
对吧？由于

109
00:03:51,920 --> 00:03:53,580
alpha times londer over M is
α 乘以 λ 除以 m

110
00:03:54,070 --> 00:03:55,920
going to be positive and usually, if you're learning rate is small and M is large.
通常情况下是正的 如果你的学习速率小 而 m 很大的话

111
00:03:58,650 --> 00:03:58,860
That's usually pretty small.
(1 - αλ/m) 这一项通常是很小的

112
00:03:59,650 --> 00:04:00,680
So this term here, it's going
所以这里的一项

113
00:04:00,740 --> 00:04:03,060
to be a number, it's usually, you know, a little bit less than one.
一般来说将是一个比1小一点点的值

114
00:04:03,340 --> 00:04:04,150
So think of it as
所以我们可以把它想成

115
00:04:04,330 --> 00:04:05,860
a number like 0.99, let's say
一个像0.99一样的数字

116
00:04:07,380 --> 00:04:08,800
and so, the effect of our
所以

117
00:04:09,120 --> 00:04:10,550
updates of theta J is we're
对 θj 更新的结果

118
00:04:10,690 --> 00:04:11,950
going to say that theta J
我们可以看作是

119
00:04:12,410 --> 00:04:15,420
gets replaced by thetata J times 0.99.
被替换为 θj 的0.99倍

120
00:04:15,770 --> 00:04:17,500
Alright so theta J
也就是 θj

121
00:04:18,490 --> 00:04:20,940
times 0.99 has the effect of
乘以0.99

122
00:04:21,280 --> 00:04:23,560
shrinking theta J a little bit towards 0.
把 θj 向 0 压缩了一点点

123
00:04:23,670 --> 00:04:25,690
So this makes theta J a bit smaller.
所以这使得 θj 小了一点

124
00:04:26,220 --> 00:04:28,080
More formally, this you know, this
更正式地说

125
00:04:28,420 --> 00:04:29,750
square norm of theta J
θj 的平方范数

126
00:04:29,870 --> 00:04:31,580
is smaller and then
更小了

127
00:04:31,720 --> 00:04:33,430
after that the second
另外 这一项后边的第二项

128
00:04:33,910 --> 00:04:35,400
term here, that's actually
这实际上

129
00:04:35,980 --> 00:04:37,930
exactly the same as the
与我们原来的

130
00:04:38,050 --> 00:04:40,270
original gradient descent updated that we had.
梯度下降更新完全一样

131
00:04:40,750 --> 00:04:42,840
Before we added all this regularization stuff.
跟我们加入了正则项之前一样

132
00:04:44,270 --> 00:04:46,920
So, hopefully this gradient
好的 现在你应该对这个

133
00:04:47,380 --> 00:04:48,630
descent, hopefully this update makes
梯度下降的更新没有疑问了

134
00:04:48,880 --> 00:04:51,350
sense, when we're using regularized linear
当我们使用正则化线性回归时

135
00:04:51,550 --> 00:04:52,920
regression what we're doing is on
我们需要做的就是

136
00:04:53,320 --> 00:04:55,210
every regularization were multiplying data
在每一个被正规化的参数 θj 上

137
00:04:55,420 --> 00:04:56,310
J by a number that
乘以了一个

138
00:04:56,400 --> 00:04:57,300
is a little bit less than one, so
比1小一点点的数字

139
00:04:57,400 --> 00:04:58,900
we're shrinking the parameter a
也就是把参数压缩了一点

140
00:04:59,230 --> 00:05:00,340
little bit, and then we're
然后

141
00:05:00,500 --> 00:05:03,000
performing a, you know, similar update as before.
我们执行跟以前一样的更新

142
00:05:04,170 --> 00:05:05,460
Of course that's just the
当然 这仅仅是

143
00:05:05,610 --> 00:05:08,310
intuition behind what this particular update is doing.
从直观上认识 这个更新在做什么

144
00:05:08,910 --> 00:05:10,130
Mathematically, what it's doing
从数学上讲

145
00:05:10,580 --> 00:05:12,950
is exactly gradient descent on
它就是带有正则化项的 J(θ)

146
00:05:13,130 --> 00:05:14,330
the cost function J of theta
的梯度下降算法

147
00:05:15,150 --> 00:05:16,020
that we defined on the previous
我们在之前的幻灯片

148
00:05:16,480 --> 00:05:18,820
slide that uses the regularization term.
给出了定义

149
00:05:19,780 --> 00:05:21,210
Gradient descent was just
梯度下降只是

150
00:05:21,470 --> 00:05:23,050
one our two algorithms for
我们拟合线性回归模型的两种算法

151
00:05:24,470 --> 00:05:25,530
fitting a linear regression model.
的其中一个

152
00:05:26,630 --> 00:05:28,090
The second algorithm was the
第二种算法是

153
00:05:28,160 --> 00:05:29,130
one based on the normal
使用正规方程

154
00:05:29,680 --> 00:05:31,650
equation where, what we
我们的做法

155
00:05:31,740 --> 00:05:32,980
did was we created the
是建立这个

156
00:05:33,060 --> 00:05:34,770
design matrix "x" where each
设计矩阵 X 其中每一行

157
00:05:35,080 --> 00:05:37,830
row corresponded to a separate training example.
对应于一个单独的训练样本

158
00:05:38,520 --> 00:05:39,790
And we created a vector
然后创建了一个向量 y

159
00:05:40,170 --> 00:05:41,780
Y, so this is
向量 y 是一个

160
00:05:41,940 --> 00:05:43,320
a vector that is an
m 维的向量

161
00:05:43,590 --> 00:05:45,520
M dimensional vector and that
m 维的向量

162
00:05:46,010 --> 00:05:47,750
contain the labels from a training set.
包含了所有训练集里的标签

163
00:05:48,470 --> 00:05:49,600
So whereas X is an
所以 X 是一个

164
00:05:49,830 --> 00:05:52,660
M by N plus 1 dimensional matrix.
m × (n+1) 维矩阵

165
00:05:53,590 --> 00:05:55,220
Y is an M dimensional
y 是一个 m 维向量

166
00:05:55,780 --> 00:05:57,550
vector and in order
y 是一个 m 维向量

167
00:05:58,030 --> 00:05:59,200
to minimize the cost
为了最小化代价函数 J

168
00:05:59,470 --> 00:06:00,940
function change we found
我们发现

169
00:06:01,470 --> 00:06:03,000
that of one way
一个办法就是

170
00:06:03,230 --> 00:06:04,440
to do is to set
一个办法就是

171
00:06:04,670 --> 00:06:06,790
theta to be equal to this.
让 θ 等于这个式子

172
00:06:07,540 --> 00:06:09,040
We have X transpose X,
即 X 的转置乘以 X 再对结果取逆

173
00:06:10,860 --> 00:06:12,770
inverse X transpose Y.
再乘以 X 的转置乘以Y

174
00:06:13,020 --> 00:06:13,920
I am leaving room here, to fill
我在这里留点空间

175
00:06:14,120 --> 00:06:17,160
in stuff of course. And what this
等下再填满

176
00:06:17,650 --> 00:06:18,820
value for theta does, is
这个 θ 的值

177
00:06:19,180 --> 00:06:20,980
this minimizes the cost
其实就是最小化

178
00:06:21,250 --> 00:06:22,710
function J of theta when
代价函数 J(θ) 的θ值

179
00:06:22,840 --> 00:06:26,280
we were not using regularization. Now
这时的代价函数J(θ)没有正则项

180
00:06:26,460 --> 00:06:28,580
that we are using regularization, if
现在如果我们用了是正则化

181
00:06:28,780 --> 00:06:30,290
you were to derive what the
我们想要得到最小值

182
00:06:30,520 --> 00:06:31,820
minimum is, and just to
我们想要得到最小值

183
00:06:31,910 --> 00:06:32,760
give you a sense of how to
我们来看看应该怎么得到

184
00:06:32,980 --> 00:06:34,110
derive the minimum, the way
我们来看看应该怎么得到

185
00:06:34,220 --> 00:06:35,220
you derive it is you know,
推导的方法是

186
00:06:35,930 --> 00:06:37,910
take partial derivatives in respect
取 J 关于各个参数的偏导数

187
00:06:38,340 --> 00:06:40,600
to each parameter, set this
并令它们

188
00:06:40,830 --> 00:06:41,910
to zero, and then do
等于0 然后做些

189
00:06:42,060 --> 00:06:42,920
a bunch of math, and you can
数学推导 你可以

190
00:06:43,100 --> 00:06:45,060
then show that is a formula
得到这样的一个式子

191
00:06:45,550 --> 00:06:47,640
like this that minimizes the cost function.
它使得代价函数最小

192
00:06:48,590 --> 00:06:52,130
And concretely, if you
具体的说 如果你

193
00:06:52,240 --> 00:06:54,080
are using regularization then this
使用正则化

194
00:06:54,250 --> 00:06:56,320
formula changes as follows. Inside this
那么公式要做如下改变

195
00:06:56,480 --> 00:06:59,120
parenthesis, you end up with a matrix like this.
括号里结尾添这样一个矩阵

196
00:06:59,460 --> 00:07:00,940
Zero, one, one, one
0 1 1 1 等等

197
00:07:01,800 --> 00:07:03,520
and so on, one until the bottom.
直到最后一行

198
00:07:04,510 --> 00:07:05,510
So this thing over here is
所以这个东西在这里是

199
00:07:05,630 --> 00:07:07,810
a matrix, who's upper leftmost entry is zero.
一个矩阵 它的左上角的元素是0

200
00:07:08,560 --> 00:07:10,080
There's ones on the diagonals and
其余对角线元素都是1

201
00:07:10,190 --> 00:07:11,960
then the zeros everywhere else on this matrix.
剩下的元素也都是 0

202
00:07:13,050 --> 00:07:14,020
Because I am drawing this a little bit sloppy.
我画的比较随意

203
00:07:15,180 --> 00:07:16,790
But as a concrete
可以举一个例子

204
00:07:17,060 --> 00:07:18,210
example if N equals 2,
如果 n 等于2

205
00:07:19,090 --> 00:07:21,110
then this matrix
那么这个矩阵

206
00:07:21,840 --> 00:07:23,500
is going to be a three by three matrix.
将是一个3 × 3 矩阵

207
00:07:24,300 --> 00:07:26,210
More generally, this matrix is
更一般地情况 该矩阵是

208
00:07:26,360 --> 00:07:27,660
a N plus one
一个 (n+1) × (n+1) 维的矩阵

209
00:07:28,270 --> 00:07:30,290
by N plus one dimensional matrix.
一个 (n+1) × (n+1) 维的矩阵

210
00:07:31,620 --> 00:07:33,150
So then N equals two then that
因此 n 等于2时

211
00:07:33,370 --> 00:07:35,410
matrix becomes something that looks like this.
矩阵看起来会像这样

212
00:07:35,980 --> 00:07:37,360
Zero, and then ones
左上角是0

213
00:07:37,640 --> 00:07:39,020
on the diagonals, and then
然后其他对角线上是1

214
00:07:39,160 --> 00:07:41,100
zeros on the rest of the diagonals.
其余部分都是0

215
00:07:42,390 --> 00:07:43,990
And once again, you know, I'm not going to those this derivation.
同样地 我不打算对这些作数学推导

216
00:07:44,620 --> 00:07:46,280
Which is frankly somewhat long and involved.
坦白说这有点费时耗力

217
00:07:46,620 --> 00:07:47,530
But it is possible to prove
但可以证明

218
00:07:47,970 --> 00:07:49,550
that if you are
如果你采用新定义的 J(θ)

219
00:07:49,940 --> 00:07:50,770
using the new definition of
如果你采用新定义的 J(θ)

220
00:07:51,250 --> 00:07:53,730
J of theta, with the regularization objective.
包含正则项的目标函数

221
00:07:54,780 --> 00:07:56,070
Then this new formula for
那么这个计算 θ 的式子

222
00:07:56,220 --> 00:07:57,180
theta is the one
能使你的 J(θ)

223
00:07:57,390 --> 00:08:00,080
that will give you the global minimum of J of theta.
达到全局最小值

224
00:08:01,420 --> 00:08:02,460
So finally, I want to
所以最后

225
00:08:02,610 --> 00:08:05,460
just quickly describe the issue of non-invertibility.
我想快速地谈一下不可逆性的问题

226
00:08:06,800 --> 00:08:08,110
This is relatively advanced material.
这部分是比较高阶的内容

227
00:08:08,600 --> 00:08:09,530
So you should consider this as
所以这一部分还是作为选学

228
00:08:09,770 --> 00:08:11,600
optional and feel free
你可以跳过去

229
00:08:11,750 --> 00:08:12,520
to skip it or if you
或者你也可以听听

230
00:08:12,660 --> 00:08:14,180
listen to it and you know, possibly it
如果听不懂的话

231
00:08:14,320 --> 00:08:15,680
don't really make sense, don't worry about it either.
也没有关系

232
00:08:16,400 --> 00:08:18,950
But earlier when I talked the normal equation method.
之前当我讲正规方程的时候

233
00:08:19,700 --> 00:08:20,920
We also had an optional video
我们也有一段选学视频

234
00:08:21,800 --> 00:08:22,960
on the non-invertability issue.
讲不可逆的问题

235
00:08:23,700 --> 00:08:25,740
So this is another optional part,
所以这是另一个选学内容

236
00:08:26,170 --> 00:08:27,070
that is sort of add on
可以作为上次视频的补充

237
00:08:27,700 --> 00:08:30,100
earlier optional video on non-invertibility.
可以作为上次视频的补充

238
00:08:31,610 --> 00:08:33,350
Now considering setting where M
现在考虑 m

239
00:08:33,850 --> 00:08:35,340
the number of examples is less
即样本总数

240
00:08:35,690 --> 00:08:37,530
than or equal to N the number features.
小与或等于特征数量 n

241
00:08:38,650 --> 00:08:40,080
If you have fewer examples then
如果你的样本数量

242
00:08:40,200 --> 00:08:41,480
features then this matrix
比特征数量小的话  那么这个矩阵

243
00:08:42,170 --> 00:08:43,870
X transpose X will be
X 转置乘以 X 将是

244
00:08:44,070 --> 00:08:47,770
non-invertible or singular, or
不可逆或奇异的(singluar)

245
00:08:48,060 --> 00:08:50,120
the other term
或者用另一种说法是

246
00:08:50,360 --> 00:08:51,470
for this is the matrix will
这个矩阵是

247
00:08:51,530 --> 00:08:53,390
be degenerate and if
退化(degenerate)的

248
00:08:53,860 --> 00:08:54,780
you implement this in Octave
如果你在 Octave 里运行它

249
00:08:55,300 --> 00:08:56,380
anyway, and you use the
无论如何

250
00:08:56,620 --> 00:08:58,570
P in function to take the psuedo inverse.
你用函数 pinv 取伪逆矩阵

251
00:08:58,850 --> 00:08:59,800
It will kind of do the
这样计算

252
00:09:00,080 --> 00:09:01,900
right thing that is not
理论上方法是正确的

253
00:09:02,240 --> 00:09:03,450
clear that it will
但实际上

254
00:09:03,560 --> 00:09:04,570
give you a very good hypothesis
你不会得到一个很好的假设

255
00:09:05,410 --> 00:09:07,720
even though numerically the octave
尽管 Ocatve 会

256
00:09:08,370 --> 00:09:09,670
P in function
用 pinv 函数

257
00:09:10,020 --> 00:09:11,050
will give you a result that
给你一个数值解

258
00:09:11,340 --> 00:09:13,210
kind of makes sense.
看起来还不错

259
00:09:13,440 --> 00:09:15,460
But, if you were doing this in a different language.
但是 如果你是在一个不同的编程语言中

260
00:09:16,270 --> 00:09:17,590
And if you were
如果在 Octave 中

261
00:09:17,710 --> 00:09:19,030
taking just the regular inverse
你用 inv 来取常规逆

262
00:09:20,470 --> 00:09:22,070
which an octave is denoted with the function INV.
你用 inv 来取常规逆

263
00:09:23,240 --> 00:09:24,010
We're trying to take the regular
也就是我们要对

264
00:09:24,330 --> 00:09:25,620
inverse of X transpose X,
X 转置乘以 X 取常规逆

265
00:09:26,300 --> 00:09:28,030
then in this setting you
然后在这样的情况下

266
00:09:28,150 --> 00:09:30,340
find that X transpose X
你会发现 X 转置乘以 X

267
00:09:30,450 --> 00:09:32,750
is singular, is non-invertible and
是奇异的 是不可逆的

268
00:09:32,790 --> 00:09:33,740
if you're doing this in a different
即使你在不同的

269
00:09:33,990 --> 00:09:35,830
programming language and using some
编程语言里计算 并使用一些

270
00:09:36,230 --> 00:09:39,160
linear algebra library try to take the inverse of this matrix.
线性代数库 试图计算这个矩阵的逆矩阵

271
00:09:39,840 --> 00:09:41,080
It just might not work because that
都是不可行的

272
00:09:41,220 --> 00:09:43,060
matrix is non-invertible or singular.
因为这个矩阵是不可逆的或奇异的

273
00:09:44,650 --> 00:09:47,110
Fortunately, regularization also takes
幸运的是 正规化也

274
00:09:47,110 --> 00:09:49,850
care of this for us, and concretely, so
为我们解决了这个问题 具体地说

275
00:09:50,010 --> 00:09:53,370
long as the regularization parameter is strictly greater than zero.
只要正则参数是严格大于0的

276
00:09:53,870 --> 00:09:55,220
It is actually possible to
实际上 可以

277
00:09:55,300 --> 00:09:56,840
prove that this matrix X
证明该矩阵 X 转置 乘以 X

278
00:09:57,080 --> 00:09:58,690
transpose X plus parameter time, you know,s
加上 λ 乘以

279
00:09:59,080 --> 00:10:00,400
this funny matrix here,
这里这个矩阵

280
00:10:00,970 --> 00:10:02,250
is possible to prove that this
可以证明

281
00:10:02,470 --> 00:10:03,650
matrix will not be
这个矩阵将不是奇异的

282
00:10:03,760 --> 00:10:05,710
singular and that this matrix will be invertible.
即该矩阵将是可逆的

283
00:10:07,450 --> 00:10:09,430
So using regularization also takes
因此 使用正则化还可以

284
00:10:09,700 --> 00:10:11,910
care of any non-invertibility issues
照顾一些 X 转置乘以 X 不可逆的问题

285
00:10:12,580 --> 00:10:14,470
of the X transpose X matrix as well.
照顾一些 X 转置乘以 X 不可逆的问题

286
00:10:15,260 --> 00:10:18,000
So, you now know how to implement regularize linear regression.
好的 你现在知道了如何实现正则化线性回归

287
00:10:18,870 --> 00:10:19,910
Using this, you'll be able
利用它 你就可以

288
00:10:20,300 --> 00:10:21,970
to avoid overfitting, even
避免过度拟合

289
00:10:22,210 --> 00:10:24,720
if you have lots of features in a relatively small training set.
即使你在一个相对较小的训练集里有很多特征

290
00:10:25,360 --> 00:10:26,630
And this should let you get
这应该可以让你

291
00:10:26,980 --> 00:10:29,000
linear regression to work much better for many problems.
在很多问题上更好地运用线性回归

292
00:10:30,060 --> 00:10:31,190
In the next video, we'll take
在接下来的视频中 我们将

293
00:10:31,390 --> 00:10:34,310
this regularization idea and apply it to logistic regression.
把这种正则化的想法应用到逻辑回归

294
00:10:35,140 --> 00:10:36,170
So that you'll be able to
这样你就可以

295
00:10:36,280 --> 00:10:37,630
get logistic impression to avoid
让逻辑回归也避免过度拟合

296
00:10:37,920 --> 00:10:39,830
overfitting and perform much better as well.
并让它表现的更好