srt/16 - 4 - Collaborative Filtering Algorithm (9 min).srt

﻿1
00:00:00,240 --> 00:00:01,690
In the last couple videos, we
在前面几个视频里
(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:01,820 --> 00:00:02,990
talked about the ideas of
我们谈到几个概念

3
00:00:03,140 --> 00:00:04,570
how, first, if you're
首先

4
00:00:04,780 --> 00:00:06,210
given features for movies, you
如果给你几个特征表示电影

5
00:00:06,920 --> 00:00:08,610
can use that to learn parameters data for users.
我们可以使用这些资料去获得用户的参数数据

6
00:00:09,490 --> 00:00:11,400
And second, if you're given parameters for the users,
第二 如果给你用户的参数数据

7
00:00:11,920 --> 00:00:13,570
you can use that to learn features for the movies.
你可以使用这些资料去获得电影的特征

8
00:00:14,480 --> 00:00:15,550
In this video we're going
本节视频中

9
00:00:15,650 --> 00:00:16,670
to take those ideas and put
我们将会使用这些概念

10
00:00:16,850 --> 00:00:18,130
them together to come up
并且将它们合并成

11
00:00:18,280 --> 00:00:20,130
with a collaborative filtering algorithm.
协同过滤算法 (Collaborative Filtering Algorithm)

12
00:00:21,250 --> 00:00:22,450
So one of the things we worked
我们之前做过的事情

13
00:00:22,520 --> 00:00:23,640
out earlier is that if
其中之一是

14
00:00:23,680 --> 00:00:24,510
you have features for the
假如你有了电影的特征

15
00:00:24,600 --> 00:00:25,740
movies then you can solve
你就可以解出

16
00:00:26,070 --> 00:00:27,590
this minimization problem to find
这个最小化问题

17
00:00:27,950 --> 00:00:30,010
the parameters theta for your users.
为你的用户找到参数 θ

18
00:00:30,730 --> 00:00:32,260
And then we also
然后我们也

19
00:00:32,640 --> 00:00:33,960
worked that out, if you
知道了

20
00:00:34,360 --> 00:00:37,440
are given the parameters theta,
如果你拥有参数 θ

21
00:00:38,080 --> 00:00:38,990
you can also use that to
你也可以用该参数

22
00:00:39,170 --> 00:00:40,800
estimate the features x, and
通过解一个最小化问题

23
00:00:40,870 --> 00:00:42,980
you can do that by solving this minimization problem.
去计算出特征 x

24
00:00:44,310 --> 00:00:45,720
So one thing you
所以你可以做的事

25
00:00:45,880 --> 00:00:47,360
could do is actually go back and forth.
是不停地重复这些计算

26
00:00:47,870 --> 00:00:50,230
Maybe randomly initialize the parameters
或许是随机地初始化这些参数

27
00:00:50,510 --> 00:00:51,350
and then solve for theta,
然后解出 θ

28
00:00:51,780 --> 00:00:52,690
solve for x, solve for theta,
解出 x 解出 θ

29
00:00:52,870 --> 00:00:54,330
solve for x. But, it
解出 x

30
00:00:54,420 --> 00:00:55,220
turns out that there is a
但实际上呢

31
00:00:55,400 --> 00:00:56,760
more efficient algorithm that doesn't
存在一个更有效率的算法

32
00:00:56,980 --> 00:00:57,910
need to go back and forth
让我们不再需要再这样不停地

33
00:00:58,110 --> 00:00:59,700
between the x's and the
计算 x 和 θ

34
00:00:59,730 --> 00:01:00,670
thetas, but that can solve
而是能够将

35
00:01:01,300 --> 00:01:04,250
for theta and x simultaneously.
x 和 θ 同时计算出来

36
00:01:05,160 --> 00:01:06,310
And here it is. What we are going to do, is basically take
下面就是这种算法 我们所要做的

37
00:01:06,600 --> 00:01:08,990
both of these optimization objectives, and
是将这两个优化目标函数

38
00:01:09,130 --> 00:01:10,640
put them into the same objective.
给合为一个

39
00:01:11,550 --> 00:01:12,590
So I'm going to define the
所以我要来定义

40
00:01:12,730 --> 00:01:15,010
new optimization objective j, which
这个新的优化目标函数 J

41
00:01:15,250 --> 00:01:16,540
is a cost function, that
它依然是一个代价函数

42
00:01:16,640 --> 00:01:17,630
is a function of my features
是我特征 x

43
00:01:18,050 --> 00:01:19,150
x and a function
和参数 θ

44
00:01:19,790 --> 00:01:20,750
of my parameters theta.
的函数

45
00:01:21,660 --> 00:01:23,050
And, it's basically the two optimization objectives
它其实就是上面那两个优化目标函数

46
00:01:23,520 --> 00:01:24,920
I had on top, but I put together.
但我将它们给合在一起

47
00:01:26,270 --> 00:01:27,760
So, in order to
为了把这个解释清楚

48
00:01:28,060 --> 00:01:31,140
explain this, first, I want to point out that this
首先 我想指出

49
00:01:31,400 --> 00:01:33,420
term over here, this squared
这里的这个表达式

50
00:01:33,820 --> 00:01:35,490
error term, is the same
这个平方误差项

51
00:01:35,920 --> 00:01:39,250
as this squared error term and the
和下面的这个项是相同的

52
00:01:39,760 --> 00:01:40,880
summations look a little bit
可能两个求和看起来有点不同

53
00:01:41,050 --> 00:01:42,940
different, but let's see what the summations are really doing.
但让我们来看看它们到底到底在做什么

54
00:01:43,800 --> 00:01:45,090
The first summation is sum
第一个求和运算

55
00:01:45,480 --> 00:01:48,280
over all users J and
是所有用户 J 的总和

56
00:01:48,380 --> 00:01:50,590
then sum over all movies rated by that user.
和所有被用户评分过的电影总和

57
00:01:51,890 --> 00:01:53,240
So, this is really summing over all
所以这其实是正在将

58
00:01:53,470 --> 00:01:55,950
pairs IJ, that correspond
所有关于 (i,j) 对的项全加起来

59
00:01:56,510 --> 00:01:57,830
to a movie that was rated by a user.
表示被用户评分过的电影

60
00:01:58,550 --> 00:01:59,960
Sum over J says, for every
关于 j 的求和

61
00:02:00,150 --> 00:02:01,520
user, the sum of
意思是 对每个用户

62
00:02:01,740 --> 00:02:03,110
all the movies rated by that user.
关于该用户评分的电影的求和

63
00:02:04,250 --> 00:02:07,340
This summation down here, just does things in the opposite order.
而下面的求和运算只是用相反的顺序去进行计算

64
00:02:07,630 --> 00:02:08,710
This says for every movie
这写着关于每部电影 i

65
00:02:09,050 --> 00:02:11,140
I, sum over all
求和 关于的是

66
00:02:11,340 --> 00:02:12,480
the users J that have
所有曾经对它评分过的

67
00:02:12,690 --> 00:02:14,580
rated that movie and so, you
用户 j

68
00:02:14,690 --> 00:02:16,100
know these summations, both of these
所以这些求和运算

69
00:02:16,220 --> 00:02:18,150
are just summations over all pairs
这两种都是对所有 (i,j) 对的求和

70
00:02:18,930 --> 00:02:21,150
ij for which
其中

71
00:02:21,440 --> 00:02:24,620
r of i J is equal to 1.
r(i,j) 是等于1的

72
00:02:24,660 --> 00:02:26,580
It's just something over all the
这只是所有你有评分的用户

73
00:02:27,180 --> 00:02:29,810
user movie pairs for which you have a rating.
和电影对而已

74
00:02:30,840 --> 00:02:32,230
and so those two terms
因此 这两个式子

75
00:02:32,600 --> 00:02:34,740
up there is just
其实就是

76
00:02:34,930 --> 00:02:36,460
exactly this first term, and
这里的第一个式子

77
00:02:36,500 --> 00:02:38,310
I've just written the summation here explicitly,
我已经给出了这个求和式子

78
00:02:39,310 --> 00:02:40,290
where I'm just saying the sum
这里我写着

79
00:02:40,580 --> 00:02:42,290
of all pairs IJ, such that
其为所有 r(i,j) 值为1的

80
00:02:42,540 --> 00:02:45,060
RIJ is equal to 1.
(i,j) 对求和

81
00:02:45,310 --> 00:02:46,800
So what we're going
所以我们要做的

82
00:02:46,940 --> 00:02:48,790
to do is define a
是去定义

83
00:02:49,130 --> 00:02:51,410
combined optimization objective that
一个我们想将其最小化的

84
00:02:51,670 --> 00:02:53,290
we want to minimize in order
合并后的优化目标函数

85
00:02:53,550 --> 00:02:55,700
to solve simultaneously for x and theta.
让我们能同时解出 x 和 θ

86
00:02:56,970 --> 00:02:58,040
And then the other terms in
然后在这些优化目标函数里的

87
00:02:58,070 --> 00:03:00,250
the optimization objective are this,
另一个式子是这个

88
00:03:00,570 --> 00:03:02,870
which is a regularization in terms of theta.
其为 θ 所进行的正则化

89
00:03:03,770 --> 00:03:05,830
So that came down here and
它被放到这里

90
00:03:06,290 --> 00:03:08,190
the final piece is this
最后一部分

91
00:03:08,900 --> 00:03:10,690
term which is
是这项式

92
00:03:10,850 --> 00:03:12,970
my optimization objective for
是我 x 的优化目标函数

93
00:03:13,170 --> 00:03:16,180
the x's and that became this.
然后它变成这个

94
00:03:16,500 --> 00:03:18,020
And this optimization objective
这个优化目标函数 J

95
00:03:18,720 --> 00:03:19,730
j actually has an interesting property
它有一个很有趣的特性

96
00:03:20,240 --> 00:03:20,950
that if you were to hold
如果你假设

97
00:03:21,410 --> 00:03:23,070
the x's constant and just
x 为常数

98
00:03:23,260 --> 00:03:25,490
minimize with respect to the thetas then
并关于 θ 优化的话

99
00:03:25,670 --> 00:03:27,040
you'd be solving exactly this problem,
你其实就是在计算这个式子

100
00:03:27,840 --> 00:03:28,450
whereas if you were to do
反过来也一样

101
00:03:28,620 --> 00:03:29,590
the opposite, if you were to
如果你把 θ 作为常量

102
00:03:29,690 --> 00:03:31,310
hold the thetas constant, and minimize
然后关于 x

103
00:03:31,670 --> 00:03:32,650
j only with respect to
求 J 的最小值的话

104
00:03:32,750 --> 00:03:34,920
the x's, then it becomes equivalent to this.
那就与第二个式子相等

105
00:03:35,230 --> 00:03:36,780
Because either this term
因为不管是这个部分

106
00:03:37,060 --> 00:03:38,860
or this term is constant if
还是这个部分 将会变成常数

107
00:03:38,970 --> 00:03:40,510
you're minimizing only the respective x's or only respective thetas.
如果你将它化简成只以 x 或 θ 表达的话

108
00:03:40,920 --> 00:03:43,680
So here's an optimization
所以这里是

109
00:03:44,640 --> 00:03:46,840
objective that puts together my
一个将我的 x 和 θ

110
00:03:47,440 --> 00:03:50,230
cost functions in terms of x and in terms of theta.
合并起来的代价函数

111
00:03:51,620 --> 00:03:53,050
And in order to
然后

112
00:03:53,470 --> 00:03:54,750
come up with just one
为了解出

113
00:03:55,090 --> 00:03:56,130
optimization problem, what we're going
这个优化目标问题

114
00:03:56,280 --> 00:03:57,590
to do, is treat this
我们所要做的是

115
00:03:58,430 --> 00:03:59,850
cost function, as a
将这个代价函数视为

116
00:03:59,880 --> 00:04:00,890
function of my features
特征 x

117
00:04:01,410 --> 00:04:02,540
x and of my user
和用户参数 θ 的

118
00:04:03,180 --> 00:04:05,020
pro user parameters data and
函数

119
00:04:05,140 --> 00:04:06,570
just minimize this whole thing, as
然后全部化简为

120
00:04:06,740 --> 00:04:07,830
a function of both the
一个既关于 x

121
00:04:08,120 --> 00:04:10,210
Xs and a function of the thetas.
也关于 θ 的函数

122
00:04:11,300 --> 00:04:12,400
And really the only difference
这和

123
00:04:12,540 --> 00:04:13,800
between this and the older
前面的算法之间

124
00:04:14,160 --> 00:04:15,650
algorithm is that, instead
唯一的不同是

125
00:04:15,980 --> 00:04:17,340
of going back and forth, previously
不需要反复计算

126
00:04:17,840 --> 00:04:20,110
we talked about minimizing with respect
就像我们之前所提到的

127
00:04:20,420 --> 00:04:22,130
to theta then minimizing with respect to x,
先关于 θ 最小化 然后关于 x 最小化

128
00:04:22,260 --> 00:04:23,370
whereas minimizing with respect to theta,
然后再关于 θ 最小化

129
00:04:23,900 --> 00:04:25,270
minimizing with respect to x and so on.
再关于 x 最小化...

130
00:04:26,130 --> 00:04:28,090
In this new version instead of
在新版本里头

131
00:04:28,560 --> 00:04:30,020
sequentially going between the
不需要不断地在 x 和 θ

132
00:04:30,220 --> 00:04:31,880
2 sets of parameters x and theta,
这两个参数之间不停折腾

133
00:04:32,180 --> 00:04:32,940
what we are going to do
我们所要做的是

134
00:04:33,230 --> 00:04:34,600
is just minimize with respect
将这两组参数

135
00:04:34,780 --> 00:04:36,410
to both sets of parameters simultaneously.
同时化简

136
00:04:39,750 --> 00:04:41,290
Finally one last detail
最后一件事是

137
00:04:42,030 --> 00:04:44,380
is that when we're learning the features this way.
当我们以这样的方法学习特征量时

138
00:04:45,110 --> 00:04:46,410
Previously we have been using
之前我们所使用的

139
00:04:46,840 --> 00:04:49,290
this convention that
前提是

140
00:04:49,470 --> 00:04:50,540
we have a feature x0 equals
我们所使用的特征 x0

141
00:04:50,740 --> 00:04:52,940
one that corresponds to an interceptor.
等于1 对应于一个截距

142
00:04:54,140 --> 00:04:55,530
When we are using this
当我们以

143
00:04:55,760 --> 00:04:57,790
sort of formalism where we're are actually learning the features,
这种形式真的去学习特征量时

144
00:04:58,300 --> 00:05:00,200
we are actually going to do away with this convention.
我们必须要去掉这个前提

145
00:05:01,400 --> 00:05:04,220
And so the features we are going to learn x, will be in Rn.
所以这些我们将学习的特征量 x 是 n 维实数

146
00:05:05,430 --> 00:05:06,650
Whereas previously we had
而先前我们所有的

147
00:05:06,810 --> 00:05:09,770
features x and Rn + 1 including the intercept term.
特征值x 是 n+1 维 包括截距

148
00:05:10,390 --> 00:05:13,390
By getting rid of x0 we now just have x in Rn.
删除掉x0 我们现在只会有 n 维的 x

149
00:05:14,880 --> 00:05:16,520
And so similarly, because the
同样地

150
00:05:16,590 --> 00:05:17,780
parameters theta is in
因为参数 θ 是

151
00:05:17,850 --> 00:05:19,260
the same dimension, we now
在同一个维度上

152
00:05:19,510 --> 00:05:21,010
also have theta in RN
所以 θ 也是 n 维的

153
00:05:21,540 --> 00:05:23,340
because if there's no
因为如果没有 x0

154
00:05:23,710 --> 00:05:24,580
x0, then there's no need
那么 θ0

155
00:05:25,370 --> 00:05:26,880
parameter theta 0 as well.
也不再需要

156
00:05:27,960 --> 00:05:28,880
And the reason we do away
我们将这个前提移除的理由是

157
00:05:29,160 --> 00:05:30,390
with this convention is because
因为我们现在是在

158
00:05:31,010 --> 00:05:32,610
we're now learning all the features, right?
学习所有的特征 对吧?

159
00:05:32,820 --> 00:05:34,280
So there is no need
所以我们没有必要

160
00:05:34,420 --> 00:05:36,650
to hard code the feature that is always equal to one.
去将这个等于一的特征值固定死

161
00:05:37,170 --> 00:05:38,310
Because if the algorithm really wants
因为如果算法真的需要

162
00:05:38,600 --> 00:05:39,450
a feature that is always equal
一个特征永远为1

163
00:05:40,060 --> 00:05:41,830
to 1, it can choose to learn one for itself.
它可以选择靠自己去获得1这个数值

164
00:05:42,290 --> 00:05:43,430
So if the algorithm chooses,
所以如果这算法想要的话

165
00:05:43,720 --> 00:05:45,330
it can set the feature X1 equals 1.
它可以将特征值 x1 设为1

166
00:05:45,670 --> 00:05:47,010
So there's no need
所以没有必要

167
00:05:47,260 --> 00:05:48,300
to hard code the feature of
去将1 这个特征定死

168
00:05:48,440 --> 00:05:50,060
001, the algorithm now has
这样算法有了

169
00:05:50,340 --> 00:05:55,890
the flexibility to just learn it by itself. So, putting
灵活性去自行学习

170
00:05:56,420 --> 00:05:58,410
everything together, here is
所以 把所有讲的这些合起来

171
00:05:58,780 --> 00:05:59,910
our collaborative filtering algorithm.
即是我们的协同过滤算法

172
00:06:01,460 --> 00:06:02,330
first we are going to
首先我们将会把

173
00:06:03,010 --> 00:06:05,580
initialize x and
x 和 θ

174
00:06:05,820 --> 00:06:07,290
theta to small random values.
初始为小的随机值

175
00:06:08,450 --> 00:06:09,200
And this is a little bit
这有点像

176
00:06:09,310 --> 00:06:11,700
like neural network training, where there
神经网络训练

177
00:06:11,720 --> 00:06:14,240
we were also initializing all the parameters of a neural network to small random values.
我们也是将所有神经网路的参数用小的随机数值来初始化

178
00:06:16,640 --> 00:06:17,730
Next we're then going
接下来 我们要用

179
00:06:17,950 --> 00:06:20,110
to minimize the cost function using
梯度下降 或者某些其他的高级优化算法

180
00:06:20,500 --> 00:06:23,360
great intercepts or one of the advance optimization algorithms.
把这个代价函数最小化

181
00:06:24,610 --> 00:06:25,890
So, if you take derivatives you
所以如果你求导的话

182
00:06:26,020 --> 00:06:27,460
find that the great intercept
你会发现梯度下降法

183
00:06:27,590 --> 00:06:29,320
like these and so this
写出来的更新式是这样的

184
00:06:29,630 --> 00:06:31,160
term here is the
这个部分就是

185
00:06:31,660 --> 00:06:33,890
partial derivative of the cost function,
代价函数

186
00:06:35,140 --> 00:06:35,940
I'm not going to write that out,
这里我简写了

187
00:06:36,110 --> 00:06:37,860
with respect to the feature
关于特征值 x(i)k 的偏微分

188
00:06:38,070 --> 00:06:40,020
value Xik and similarly
然后相同地

189
00:06:41,020 --> 00:06:42,430
this term here is also
这部分

190
00:06:43,030 --> 00:06:44,660
a partial derivative value of
也是代价函数

191
00:06:44,730 --> 00:06:46,480
the cost function with respect to the parameter
关于我们正在最小化的参数 θ

192
00:06:46,930 --> 00:06:48,950
theta that we're minimizing.
所做的偏微分

193
00:06:50,210 --> 00:06:51,410
And just as a reminder, in
提醒一下

194
00:06:51,760 --> 00:06:52,920
this formula that we no
这公式里

195
00:06:53,130 --> 00:06:54,760
longer have this X0 equals
我们不再有这等于1 的 x0 项

196
00:06:54,970 --> 00:06:56,740
1 and so we have
所以

197
00:06:57,010 --> 00:07:00,010
that x is in Rn and theta is a Rn.
x 是 n 维 θ 也是n 维

198
00:07:01,480 --> 00:07:03,100
In this new formalism, we're regularizing
在这个新的表达式里

199
00:07:03,760 --> 00:07:05,220
every one of our perimeters theta, you know, every one of our parameters Xn.
我们将所有的参数 θ 和 xn 做正则化

200
00:07:07,400 --> 00:07:09,060
There's no longer the special
不存在 θ0

201
00:07:09,480 --> 00:07:11,850
case theta zero, which was
这种特殊的情况 会需要不同地正则化

202
00:07:12,210 --> 00:07:13,760
regularized differently, or which
或者说是

203
00:07:13,860 --> 00:07:15,440
was not regularized compared to
跟 θ1 到 θn 的正则化

204
00:07:15,560 --> 00:07:17,650
the parameters theta 1 down to theta.
不同的 θ0 的正则化

205
00:07:18,370 --> 00:07:19,710
So there is now no longer a
所以现在不存在 θ0

206
00:07:20,070 --> 00:07:21,150
theta 0, which is why
这就是为什么

207
00:07:21,400 --> 00:07:22,450
in these updates, I did not
在这些更新式里

208
00:07:22,700 --> 00:07:24,080
break out a special case for k equals 0.
我并没有分出 k 等于0的特殊情况

209
00:07:26,070 --> 00:07:27,230
So we then use gradient descent
所以我们使用梯度下降

210
00:07:27,740 --> 00:07:28,710
to minimize the cost function
来最小化这个

211
00:07:29,090 --> 00:07:30,260
j with respect to the
代价函数 J

212
00:07:30,390 --> 00:07:32,000
features x and with respect to the parameters theta.
关于特征 x 和参数 θ

213
00:07:33,160 --> 00:07:35,050
And finally, given a
最后 给你一个用户

214
00:07:35,140 --> 00:07:36,320
user, if a user
如果这个用户

215
00:07:36,570 --> 00:07:38,920
has some parameters, theta, and
具有一些参数 θ

216
00:07:39,410 --> 00:07:40,540
if there's a movie with
以及给你一部电影

217
00:07:40,690 --> 00:07:41,980
some sort of learned features x,
带有已知的特征 x

218
00:07:42,580 --> 00:07:43,720
we would then predict that that
我们可以预测

219
00:07:43,970 --> 00:07:44,940
movie would be given a
这部电影会被

220
00:07:45,030 --> 00:07:46,200
star rating by that user
θ 转置乘以 x

221
00:07:47,010 --> 00:07:48,780
of theta transpose j. Or
给出怎样的评分

222
00:07:48,860 --> 00:07:50,370
just to fill those in,
或者将这些直接填入

223
00:07:50,640 --> 00:07:52,250
then we're saying that if user
那我们可以说

224
00:07:52,630 --> 00:07:53,780
J has not yet
如果用户 j

225
00:07:54,010 --> 00:07:55,980
rated movie I, then
尚未对电影 i 评分

226
00:07:56,170 --> 00:07:57,300
what we do is predict that
那我们可以预测

227
00:07:58,150 --> 00:07:59,120
user J is going to
这个用户 j

228
00:07:59,710 --> 00:08:01,420
rate movie I according to
将会根据 θ(j) 转置乘以 x(i)

229
00:08:02,300 --> 00:08:04,230
theta J transpose Xi.
对电影 i 评分

230
00:08:06,650 --> 00:08:08,010
So that's the collaborative
所以这就是

231
00:08:08,810 --> 00:08:10,170
filtering algorithm and if
协同过滤算法

232
00:08:10,310 --> 00:08:12,230
you implement this algorithm you actually get a pretty
如果你使用这个算法

233
00:08:12,730 --> 00:08:14,080
decent algorithm that will simultaneously
你可以得到一个十分有用的算法

234
00:08:15,060 --> 00:08:16,770
learn good features for hopefully
可以同时学习

235
00:08:17,110 --> 00:08:18,460
all the movies as well as
几乎所有电影的特征

236
00:08:18,570 --> 00:08:19,890
learn parameters for all the
和所有用户参数

237
00:08:20,050 --> 00:08:21,290
users and hopefully give
然后有很大机会

238
00:08:21,440 --> 00:08:23,060
pretty good predictions for how
能对不同用户会如何对他们尚未评分的电影做出评??价

239
00:08:23,290 --> 00:08:25,890
different users will rate different movies that they have not yet rated
给出相当准确的预测