srt/17 - 6 - Map Reduce and Data Parallelism (14 min).srt

﻿1
00:00:00,320 --> 00:00:01,510
In the last few videos, we talked
在上面几个视频中，我们讨论了
(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:01,810 --> 00:00:03,430
about stochastic gradient descent, and,
随机梯度下降

3
00:00:03,620 --> 00:00:05,020
you know, other variations of the
以及梯度下降算法的

4
00:00:05,120 --> 00:00:06,530
stochastic gradient descent algorithm,
其他一些变种

5
00:00:06,910 --> 00:00:09,150
including those adaptations to online
包括如何将其

6
00:00:09,490 --> 00:00:10,420
learning, but all of those
运用于在线学习

7
00:00:10,610 --> 00:00:11,810
algorithms could be run on
然而所有这些算法

8
00:00:12,110 --> 00:00:13,740
one machine, or could be run on one computer.
都只能在一台计算机上运行

9
00:00:14,800 --> 00:00:15,870
And some machine learning problems
但是 有些机器学习问题

10
00:00:16,310 --> 00:00:17,270
are just too big to run
太大以至于不可能

11
00:00:17,520 --> 00:00:19,160
on one machine, sometimes maybe
只在一台计算机上运行

12
00:00:19,300 --> 00:00:21,050
you just so much data you
有时候 它涉及的数据量如此巨大

13
00:00:21,170 --> 00:00:22,350
just don't ever want to run
不论你使用何种算法

14
00:00:22,670 --> 00:00:23,980
all that data through a
你都不希望只使用

15
00:00:24,100 --> 00:00:26,270
single computer, no matter what algorithm you would use on that computer.
一台计算机来处理这些数据

16
00:00:28,470 --> 00:00:29,640
So in this video I'd
因此 在这个视频中

17
00:00:29,740 --> 00:00:31,240
like to talk about different approach
我希望介绍

18
00:00:31,770 --> 00:00:33,610
to large scale machine learning, called
进行大规模机器学习的另一种方法

19
00:00:34,010 --> 00:00:36,190
the map reduce approach.
称为map reduce (映射 化简) 方法

20
00:00:37,030 --> 00:00:38,080
And even though we have
尽管我们

21
00:00:38,380 --> 00:00:39,400
quite a few videos on stochastic
用了多个视频讲解

22
00:00:39,970 --> 00:00:41,230
gradient descent and we're going
随机梯度下降算法

23
00:00:41,550 --> 00:00:43,100
to spend relative less time
而我们将只用少量时间

24
00:00:43,460 --> 00:00:45,350
on map reduce--don't judge the
介绍map reduce

25
00:00:45,560 --> 00:00:46,750
relative importance of map reduce
但是请不要根据

26
00:00:47,160 --> 00:00:48,240
versus the gradient descent
我们所花的时间长短

27
00:00:48,690 --> 00:00:49,590
based on the amount amount of
来判断哪一种技术

28
00:00:49,660 --> 00:00:51,480
time I spend on these ideas in particular.
更加重要

29
00:00:52,230 --> 00:00:53,380
Many people will say that
事实上 许多人认为

30
00:00:53,790 --> 00:00:54,840
map reduce is at least
map reduce方法至少是

31
00:00:55,090 --> 00:00:56,330
an equally important, and some
同等重要的

32
00:00:56,580 --> 00:00:57,850
would say an even more important idea
还有人认为map reduce方法

33
00:00:58,500 --> 00:01:00,620
compared to gradient descent, only
甚至比梯度下降方法更重要

34
00:01:01,460 --> 00:01:03,040
it's relatively simpler to
我们之所以只在

35
00:01:03,160 --> 00:01:04,620
explain, which is why I'm
map reduce上话少量时间

36
00:01:04,720 --> 00:01:05,580
going to spend less time on
只是因为它相对简单 容易解释

37
00:01:05,830 --> 00:01:07,040
it, but using these ideas
然而 实际上

38
00:01:07,670 --> 00:01:08,400
you might be able to scale
相比于随机梯度下降方法

39
00:01:09,070 --> 00:01:10,640
learning algorithms to even
map reduce方法

40
00:01:10,880 --> 00:01:12,520
far larger problems than is
能够处理

41
00:01:12,630 --> 00:01:14,530
possible using stochastic gradient descent.
更大规模的问题

42
00:01:18,720 --> 00:01:19,000
Here's the idea.
它的想法是这样的

43
00:01:19,810 --> 00:01:21,020
Let's say we want to fit
假设我们要

44
00:01:21,490 --> 00:01:22,960
a linear regression model or
拟合一个线性回归模型

45
00:01:23,140 --> 00:01:24,440
a logistic regression model or some
或者Logistic回归模型

46
00:01:24,540 --> 00:01:26,100
such, and let's start again
或者其他的什么模型

47
00:01:26,430 --> 00:01:27,660
with batch gradient descent, so
让我们再次从随机梯度下降算法开始吧

48
00:01:27,840 --> 00:01:30,300
that's our batch gradient descent learning rule.
这就是我们的随机梯度下降学习算法

49
00:01:31,240 --> 00:01:32,430
And to keep the writing
为了让幻灯片上的文字

50
00:01:32,850 --> 00:01:34,170
on this slide tractable, I'm going
更容易理解

51
00:01:34,340 --> 00:01:36,990
to assume throughout that we have m equals 400 examples.
我们将假定m固定为400个样本

52
00:01:37,530 --> 00:01:39,560
Of course, by our
当然 根据

53
00:01:39,750 --> 00:01:40,850
standards, in terms of large scale
大规模机器学习的标准

54
00:01:41,090 --> 00:01:42,050
machine learning, you know m
m等于400

55
00:01:42,170 --> 00:01:43,210
might be pretty small and so,
实在是太小了

56
00:01:43,770 --> 00:01:45,390
this might be more commonly
也许在实际问题中

57
00:01:45,870 --> 00:01:46,920
applied to problems, where you
你更有可能遇到

58
00:01:47,050 --> 00:01:48,190
have maybe closer to 400
样本大小为4亿

59
00:01:48,740 --> 00:01:49,940
million examples, or some
的数据

60
00:01:50,080 --> 00:01:51,310
such, but just to
或者其他差不多的大小

61
00:01:51,390 --> 00:01:52,330
make the writing on the slide
但是 为了使我们的讲解更加简单和清晰

62
00:01:52,770 --> 00:01:55,000
simpler, I'm going to pretend we have 400 examples.
我们假定我们只有400个样本

63
00:01:55,690 --> 00:01:57,460
So in that case, the
这样以来

64
00:01:57,790 --> 00:01:59,080
batch gradient descent learning rule
随机梯度下降学习算法中

65
00:01:59,570 --> 00:02:00,930
has this 400 and the
这里是400

66
00:02:01,500 --> 00:02:02,930
sum from i equals 1 through
以及400个样本的求和

67
00:02:03,330 --> 00:02:05,050
400 through my 400 examples
这里i从1取到400

68
00:02:05,590 --> 00:02:06,890
here, and if m
如果m很大

69
00:02:07,050 --> 00:02:09,780
is large, then this is a computationally expensive step.
那么这一步的计算量将会很大

70
00:02:10,890 --> 00:02:12,830
So, what the MapReduce idea
因此 下面我们来介绍

71
00:02:13,250 --> 00:02:14,470
does is the following, and
map reduce算法

72
00:02:14,890 --> 00:02:15,740
I should say the map
这里我必须指出

73
00:02:15,950 --> 00:02:16,940
reduce idea is due to
map reduce算法的基本思想

74
00:02:17,680 --> 00:02:20,190
two researchers, Jeff Dean
来自Jeff Dean和Sanjay Gimawat

75
00:02:20,700 --> 00:02:22,060
and Sanjay Gimawat.
这两位研究者

76
00:02:22,640 --> 00:02:23,490
Jeff Dean, by the way, is
Jeff Dean是硅谷

77
00:02:24,190 --> 00:02:26,520
one of the most legendary engineers in
最为传奇般的

78
00:02:26,660 --> 00:02:28,300
all of Silicon Valley and he
一位工程师

79
00:02:28,420 --> 00:02:29,530
kind of built a large
今天谷歌 (Google) 所有的服务

80
00:02:29,820 --> 00:02:31,670
fraction of the architectural
所依赖的后台基础架构

81
00:02:32,310 --> 00:02:34,770
infrastructure that all of Google runs on today.
有很大一部分是他创建的

82
00:02:36,000 --> 00:02:37,320
But here's the map reduce idea.
接下来我们回到 map reduce 的基本想法

83
00:02:37,850 --> 00:02:38,570
So, let's say I have
假设我们有一个

84
00:02:38,700 --> 00:02:39,840
some training set, if we
训练样本

85
00:02:39,900 --> 00:02:41,220
want to denote by this box here
我们将它表示为

86
00:02:41,610 --> 00:02:42,760
of X Y pairs,
这个方框中的一系列X~Y数据对

87
00:02:44,250 --> 00:02:47,730
where it's X1, Y1, down
从X1~Y1开始

88
00:02:47,990 --> 00:02:49,640
to my 400 examples,
涵盖我所有的400个样本

89
00:02:50,520 --> 00:02:51,660
Xm, Ym.
直到X400~Y400

90
00:02:52,190 --> 00:02:53,780
So, that's my training set with 400 training examples.
总之 这就是我的400个训练样本

91
00:02:55,060 --> 00:02:56,550
In the MapReduce idea, one way
根据map reduce思想

92
00:02:56,690 --> 00:02:58,190
to do, is split this training
一种解决方案是

93
00:02:58,570 --> 00:03:00,510
set in to different subsets.
将训练集划分成几个不同的子集

94
00:03:01,890 --> 00:03:02,590
I'm going to.
在这个例子中

95
00:03:02,950 --> 00:03:04,150
assume for this example that
我假定我有

96
00:03:04,290 --> 00:03:05,530
I have 4 computers,
4台计算机

97
00:03:06,160 --> 00:03:07,160
or 4 machines to run in
它们并行的

98
00:03:07,300 --> 00:03:08,670
parallel on my training set,
处理我的训练数据

99
00:03:08,890 --> 00:03:10,570
which is why I'm splitting this into 4 machines.
因此我要将数据划分成4份 分给这4台计算机

100
00:03:10,920 --> 00:03:12,290
If you have 10 machines or
如果你有10台计算机

101
00:03:12,400 --> 00:03:13,810
100 machines, then you would
或者100台计算机

102
00:03:13,970 --> 00:03:15,890
split your training set into 10 pieces or 100 pieces or what have you.
那么你可能会将训练数据划分成10份或者100份

103
00:03:18,040 --> 00:03:19,710
And what the first of my
我的4台计算机中

104
00:03:19,850 --> 00:03:20,840
4 machines is to do,
第一台

105
00:03:21,100 --> 00:03:23,170
say, is use just the
将处理第一个

106
00:03:23,270 --> 00:03:25,170
first one quarter of my
四分之一训练数据

107
00:03:25,300 --> 00:03:28,680
training set--so use just the first 100 training examples.
也就是前100个训练样本

108
00:03:30,020 --> 00:03:31,440
And in particular, what it's
具体来说

109
00:03:31,480 --> 00:03:32,520
going to do is look at
这台计算机

110
00:03:32,630 --> 00:03:34,800
this summation, and compute
将参与处理这个求和

111
00:03:35,490 --> 00:03:38,560
that summation for just the first 100 training examples.
它将对前100个训练样本进行求和运算

112
00:03:40,030 --> 00:03:40,960
So let me write that up
让我把公式写下来吧

113
00:03:41,110 --> 00:03:42,530
I'm going to compute a variable
我将计算临时变量

114
00:03:43,560 --> 00:03:46,230
temp 1 to superscript 1
temp 1 这里的上标1

115
00:03:46,320 --> 00:03:49,410
the first machine J equals
表示第一台计算机

116
00:03:50,450 --> 00:03:52,150
sum from equals 1 through
其下标为j 该变量等于从1到100的求和

117
00:03:52,260 --> 00:03:53,160
100, and then I'm going to plug
然后我在这里写的部分

118
00:03:53,500 --> 00:03:56,610
in exactly that term there--so I have
和这里的完全相同

119
00:03:57,260 --> 00:04:00,140
X-theta, Xi, minus Yi
也就是h θ Xi减Yi

120
00:04:01,800 --> 00:04:03,230
times Xij, right?
乘以Xij

121
00:04:03,740 --> 00:04:05,680
So that's just that
这其实就是

122
00:04:05,910 --> 00:04:07,460
gradient descent term up there.
这里的梯度下降公式中的这一项

123
00:04:08,300 --> 00:04:09,780
And then similarly, I'm going
然后 类似的

124
00:04:10,010 --> 00:04:11,330
to take the second quarter
我将用第二台计算机

125
00:04:11,600 --> 00:04:13,130
of my data and send it
处理我的

126
00:04:13,320 --> 00:04:14,520
to my second machine, and
第二个四分之一数据

127
00:04:14,690 --> 00:04:15,680
my second machine will use
也就是说 我的第二台计算机

128
00:04:15,900 --> 00:04:18,750
training examples 101 through 200
将使用第101到200号训练样本

129
00:04:19,350 --> 00:04:21,170
and you will compute similar variables
类似的 我们用它

130
00:04:21,720 --> 00:04:22,880
of a temp to j which
计算临时变量 temp 2 j

131
00:04:23,110 --> 00:04:24,450
is the same sum for index
也就是从101到200号

132
00:04:24,890 --> 00:04:26,620
from examples 101 through 200.
数据的求和

133
00:04:26,840 --> 00:04:29,680
And similarly machines 3
类似的 第三台和第四台

134
00:04:29,830 --> 00:04:32,720
and 4 will use the
计算机将会使用

135
00:04:32,830 --> 00:04:34,110
third quarter and the fourth
第三个和第四个

136
00:04:34,570 --> 00:04:36,550
quarter of my training set.
四分之一训练样本

137
00:04:37,530 --> 00:04:38,950
So now each machine has
这样 现在每台计算机

138
00:04:39,190 --> 00:04:40,580
to sum over 100 instead
不用处理400个样本

139
00:04:41,060 --> 00:04:42,570
of over 400 examples and so
而只用处理100个样本

140
00:04:42,760 --> 00:04:43,750
has to do only a quarter
它们只用完成

141
00:04:44,050 --> 00:04:45,220
of the work and thus presumably
四分之一的工作量

142
00:04:45,900 --> 00:04:48,000
it could do it about four times as fast.
这样 也许可以将运算速度提高到原来的四倍

143
00:04:49,380 --> 00:04:50,630
Finally, after all these machines
最后 当这些计算机

144
00:04:50,990 --> 00:04:51,740
have done this work, I am
全都完成了各自的工作

145
00:04:51,850 --> 00:04:53,560
going to take these temp variables
我会将这些临时变量

146
00:04:55,350 --> 00:04:56,480
and put them back together.
收集到一起

147
00:04:56,870 --> 00:04:58,400
So I take these variables and
我会将它们

148
00:04:58,530 --> 00:04:59,950
send them all to a You
送到一个

149
00:05:00,090 --> 00:05:03,080
know centralized master server and
中心计算服务器

150
00:05:03,300 --> 00:05:04,750
what the master will do
这台服务器会

151
00:05:05,140 --> 00:05:06,720
is combine these results together.
将这些临时变量合并起来

152
00:05:07,360 --> 00:05:08,470
and in particular, it will
具体来说

153
00:05:08,780 --> 00:05:10,780
update my parameters theta
它将根据以下公式

154
00:05:11,000 --> 00:05:13,160
j according to theta
来更新参数θj

155
00:05:13,410 --> 00:05:14,720
j gets updated as theta j
新的θj将等于

156
00:05:15,730 --> 00:05:17,560
minus Of the
旧的θj减去

157
00:05:17,680 --> 00:05:19,510
learning rate alpha times one
学习速率α乘以

158
00:05:20,120 --> 00:05:22,940
over 400 times temp,
400分之一

159
00:05:23,300 --> 00:05:27,410
1, J, plus temp
乘以临时变量 temp 1 j

160
00:05:27,760 --> 00:05:30,290
2j plus temp 3j
加temp 2j 加temp 3j

161
00:05:32,400 --> 00:05:35,470
plus temp 4j and
加temp 4j

162
00:05:35,560 --> 00:05:37,890
of course we have to do this separately for J equals 0.
当然 对于j等于0的情况我们需要单独处理

163
00:05:37,980 --> 00:05:39,570
You know, up to
这里 j从0

164
00:05:39,820 --> 00:05:41,220
and within this number of features.
取到特征总数n

165
00:05:42,550 --> 00:05:45,420
So operating this equation into I hope it's clear.
通过将这个公式拆成多行讲解 我希望大家已经理解了

166
00:05:45,670 --> 00:05:47,870
So what this equation
其实 这个公式计算的数值

167
00:05:50,930 --> 00:05:53,220
is doing is exactly the
和原先的梯度下降公式计算的数值

168
00:05:53,290 --> 00:05:54,570
same is that when you
是完全一样的

169
00:05:54,660 --> 00:05:56,140
have a centralized master server
只不过 现在我们有一个中心运算服务器

170
00:05:56,680 --> 00:05:57,950
that takes the results, the ten
它收集了一些部分计算结果

171
00:05:58,040 --> 00:05:58,780
one j the ten two j
temp 1j temp 2j

172
00:05:59,000 --> 00:05:59,850
ten three j and ten four
temp 3j 和 temp4j

173
00:05:59,970 --> 00:06:01,760
j and adds them up
把它们加了起来

174
00:06:02,030 --> 00:06:03,430
and so of course the sum
很显然 这四个

175
00:06:04,090 --> 00:06:04,960
of these four things.
临时变量的和

176
00:06:06,360 --> 00:06:07,810
Right, that's just the sum of
就是这个求和

177
00:06:08,060 --> 00:06:09,440
this, plus the sum
加上这个求和

178
00:06:09,760 --> 00:06:11,490
of this, plus the sum
加上这个求和

179
00:06:11,630 --> 00:06:13,000
of this, plus the sum
再加上这个求和

180
00:06:13,120 --> 00:06:14,290
of that, and those four
它们加起来的和

181
00:06:14,470 --> 00:06:15,830
things just add up to
其实和原先

182
00:06:15,920 --> 00:06:17,740
be equal to this sum that
我们使用批量梯度下降公式

183
00:06:17,880 --> 00:06:19,580
we're originally computing a batch stream descent.
计算的结果是一样的

184
00:06:20,590 --> 00:06:21,550
And then we have the alpha times
接下来 我们有

185
00:06:21,860 --> 00:06:22,910
1 of 400, alpha times 1
α乘以400分之一

186
00:06:23,350 --> 00:06:24,690
of 100, and this is
这里也是α乘以400分之一

187
00:06:25,020 --> 00:06:27,020
exactly equivalent to the
因此这个公式

188
00:06:27,140 --> 00:06:29,390
batch gradient descent algorithm, only,
完全等同于批量梯度下降公式

189
00:06:29,910 --> 00:06:30,880
instead of needing to sum
唯一的不同是

190
00:06:31,290 --> 00:06:32,540
over all four hundred training
我们原本需要在一台计算机上

191
00:06:32,810 --> 00:06:33,900
examples on just one
完成400个训练样本的求和

192
00:06:34,040 --> 00:06:35,280
machine, we can instead
而现在

193
00:06:35,760 --> 00:06:37,460
divide up the work load on four machines.
我们将这个工作分给了4台计算机

194
00:06:39,090 --> 00:06:40,190
So, here's what the general
总结来说

195
00:06:40,630 --> 00:06:43,410
picture of the MapReduce technique looks like.
map reduce技术是这么工作的

196
00:06:45,060 --> 00:06:46,510
We have some training sets, and
我们有一些训练样本

197
00:06:46,670 --> 00:06:48,200
if we want to paralyze across four
如果我们希望使用4台计算机

198
00:06:48,420 --> 00:06:49,100
machines, we are going to
并行的运行机器学习算法

199
00:06:49,170 --> 00:06:51,670
take the training set and split it, you know, equally.
那么我们将训练样本等分

200
00:06:52,120 --> 00:06:54,640
Split it as evenly as we can into four subsets.
尽量均匀的分成4份

201
00:06:56,470 --> 00:06:57,110
Then we are going to take the
然后 我们将这4个

202
00:06:57,180 --> 00:06:59,560
4 subsets of the training data and send them to 4 different computers.
训练样本的子集送给4台不同的计算机

203
00:07:00,530 --> 00:07:01,660
And each of the 4 computers
每一台计算机

204
00:07:02,130 --> 00:07:03,570
can compute a summation over
对四分之一的

205
00:07:03,950 --> 00:07:04,850
just one quarter of the
训练数据

206
00:07:04,910 --> 00:07:06,230
training set, and then
进行求和运算

207
00:07:06,340 --> 00:07:07,720
finally take each of
最后 这4个求和结果

208
00:07:07,780 --> 00:07:09,310
the computers takes the results, sends
被送到一台中心计算服务器

209
00:07:09,580 --> 00:07:12,720
them to a centralized server, which then combines the results together.
负责对结果进行汇总

210
00:07:13,570 --> 00:07:14,900
So, on the previous line
在前一张幻灯片中

211
00:07:15,190 --> 00:07:16,540
in that example, the bulk
在那个例子中

212
00:07:16,800 --> 00:07:17,910
of the work in gradient descent,
梯度下降计算

213
00:07:18,330 --> 00:07:20,140
was computing the sum from
的内容是对i等于1到400的

214
00:07:20,430 --> 00:07:22,270
i equals 1 to 400 of something.
400个样本进行求和运算

215
00:07:22,670 --> 00:07:24,110
So more generally, sum from
更宽泛的来讲 在梯度下降计算中

216
00:07:24,370 --> 00:07:25,570
i equals 1 to m
我们是对i等于1到m的m个样本

217
00:07:26,320 --> 00:07:28,180
of that formula for gradient descent.
进行求和

218
00:07:29,160 --> 00:07:30,430
And now, because each of
现在 因为这4台计算机

219
00:07:30,550 --> 00:07:31,890
the four computers can do just
的每一台都可以

220
00:07:32,190 --> 00:07:33,800
a quarter of the work, potentially
完成四分之一的计算工作

221
00:07:34,350 --> 00:07:35,940
you can get up to a 4x speed up.
因此你可能会得到4倍的加速

222
00:07:38,820 --> 00:07:39,980
In particular, if there were
特别的

223
00:07:40,190 --> 00:07:41,900
no network latencies and
如果没有网络延时

224
00:07:42,100 --> 00:07:42,970
no costs of the network
也不考虑

225
00:07:43,400 --> 00:07:44,450
communications to send the
通过网络来回传输数据

226
00:07:44,600 --> 00:07:45,450
data back and forth, you can
所消耗的时间

227
00:07:45,610 --> 00:07:47,820
potentially get up to a 4x speed up.
那么你可能可以得到4倍的加速

228
00:07:48,050 --> 00:07:49,410
Of course, in practice,
当然 在实际工作中

229
00:07:50,100 --> 00:07:52,080
because of network latencies,
因为网络延时

230
00:07:52,810 --> 00:07:54,500
the overhead of combining the
数据汇总额外消耗时间

231
00:07:54,600 --> 00:07:55,880
results afterwards and other factors,
以及其他的一些因素

232
00:07:56,640 --> 00:07:59,150
in practice you get slightly less than a 4x speedup.
你能得到的加速总是略小于4倍的

233
00:08:00,140 --> 00:08:01,280
But, none the less, this sort
但是 不管怎么说

234
00:08:01,360 --> 00:08:02,710
of macro juice approach does offer
这种map reduce算法

235
00:08:03,110 --> 00:08:04,560
us a way to process much
确实让我们能够处理

236
00:08:04,870 --> 00:08:05,940
larger data sets than is
通常单台计算机

237
00:08:06,270 --> 00:08:07,550
possible using a single computer.
所无法处理的大规模数据

238
00:08:08,770 --> 00:08:10,060
If you are thinking of applying
如果你打算

239
00:08:10,730 --> 00:08:12,200
Map Reduce to some learning
将map reduce技术用于

240
00:08:12,350 --> 00:08:14,260
algorithm, in order to speed this up.
加速某个机器学习算法

241
00:08:14,750 --> 00:08:16,160
By paralleling the computation
也就是说 你打算运用多台不同的计算机

242
00:08:16,900 --> 00:08:18,480
over different computers, the key
并行的进行计算

243
00:08:18,730 --> 00:08:20,040
question to ask yourself is,
那么你需要问自己一个很关键的问题

244
00:08:20,760 --> 00:08:22,190
can your learning algorithm be expressed
那就是 你的机器学习算法

245
00:08:22,880 --> 00:08:25,150
as a summation over the training set?
是否可以表示为训练样本的某种求和

246
00:08:25,440 --> 00:08:26,430
And it turns out that many
事实证明

247
00:08:26,670 --> 00:08:28,100
learning algorithms can actually be
很多机器学习算法

248
00:08:28,410 --> 00:08:29,880
expressed as computing sums of
的确可以表示为

249
00:08:30,170 --> 00:08:31,820
functions over the training set and
关于训练样本的函数求和

250
00:08:32,610 --> 00:08:34,030
the computational expense of running
而在处理大数据时

251
00:08:34,250 --> 00:08:35,480
them on large data sets is
这些算法的主要运算量

252
00:08:35,600 --> 00:08:37,810
because they need to sum over a very large training set.
在于对大量训练数据求和

253
00:08:38,620 --> 00:08:39,870
So, whenever your learning algorithm
因此 只要你的机器学习算法

254
00:08:40,200 --> 00:08:41,350
can be expressed as a
可以表示为

255
00:08:41,450 --> 00:08:42,410
sum of the training set
训练样本的一个求和

256
00:08:42,660 --> 00:08:43,760
and whenever the bulk of the
只要算法的

257
00:08:43,860 --> 00:08:44,810
work of the learning algorithm
主要计算部分

258
00:08:45,200 --> 00:08:46,170
can be expressed as the sum
可以表示为

259
00:08:46,320 --> 00:08:47,780
of the training set, then map
训练样本的求和

260
00:08:48,030 --> 00:08:49,030
reviews might a good candidate
那么你可以考虑使用map reduce技术

261
00:08:50,100 --> 00:08:52,830
for scaling your learning algorithms through very, very good data sets.
来将你的算法扩展到非常大规模的数据上

262
00:08:53,880 --> 00:08:54,910
Lets just look at one more example.
让我们再看一个例子

263
00:08:56,020 --> 00:08:58,120
Let's say that we want to use one of the advanced optimization algorithm.
假设我们想使用某种高级优化算法

264
00:08:58,410 --> 00:08:59,430
So, things like, you
比如说

265
00:08:59,550 --> 00:09:00,460
know, l, b, f, g, s constant
LBFGS算法

266
00:09:00,900 --> 00:09:02,960
gradient and so on, and
或者共轭梯度算法等等

267
00:09:03,070 --> 00:09:05,190
let's say we want to train a logistic regression of the algorithm.
假设我们想使用logistic回归算法

268
00:09:06,080 --> 00:09:08,680
For that, we need to compute two main quantities.
于是 我们需要计算两个值

269
00:09:09,300 --> 00:09:10,460
One is for the advanced
对于LBFGS算法和共轭梯度算法

270
00:09:10,960 --> 00:09:13,520
optimization algorithms like, you know, LPF and constant gradient.
我们需要计算的第一个值是

271
00:09:14,310 --> 00:09:15,270
We need to provide it a
我们需要提供一种方法

272
00:09:15,530 --> 00:09:17,210
routine to compute the
用于计算

273
00:09:17,460 --> 00:09:18,760
cost function of the optimization objective.
优化目标的成本函数值

274
00:09:20,220 --> 00:09:21,690
And so for logistic regression, you
比如 对于logistic回归

275
00:09:21,820 --> 00:09:22,870
remember that a cost function
你应该记得它的成本函数

276
00:09:23,660 --> 00:09:24,700
has this sort of sum over
可以表示为

277
00:09:24,960 --> 00:09:26,340
the training set, and so
训练样本上的这种求和

278
00:09:26,970 --> 00:09:28,980
if youre paralizing over
因此 如果你想在

279
00:09:29,110 --> 00:09:29,970
ten machines, you would split
10台计算机上并行计算

280
00:09:30,310 --> 00:09:31,640
up the training set onto ten
那么你需要将训练样本

281
00:09:31,910 --> 00:09:33,150
machines and have each
分给这10台计算机

282
00:09:33,360 --> 00:09:35,380
of the ten machines compute the sum
让每台计算机

283
00:09:35,860 --> 00:09:37,460
of this quantity over just
计算10份之一

284
00:09:37,760 --> 00:09:38,660
one tenth of the training
训练数据的

285
00:09:40,370 --> 00:09:40,370
data.
求和

286
00:09:40,670 --> 00:09:41,550
Then, the other thing that the
高级优化算法

287
00:09:42,110 --> 00:09:43,400
advanced optimization algorithms need,
还需要提供

288
00:09:43,660 --> 00:09:44,790
is a routine to compute
这些偏导数

289
00:09:45,190 --> 00:09:47,160
these partial derivative terms.
的计算方法

290
00:09:47,280 --> 00:09:48,980
Once again, these derivative terms, for
同样的 对于logistic回归

291
00:09:49,100 --> 00:09:50,350
which it's a logistic regression, can
这些偏导数

292
00:09:50,540 --> 00:09:51,840
be expressed as a sum over
可以表示为

293
00:09:52,010 --> 00:09:53,130
the training set, and so once
训练数据的求和

294
00:09:53,330 --> 00:09:54,600
again, similar to our earlier
因此 和之前的例子类似

295
00:09:54,950 --> 00:09:56,060
example, you would have
你可以让

296
00:09:56,520 --> 00:09:57,800
each machine compute that summation
每台计算机只计算

297
00:09:58,800 --> 00:10:01,170
over just some small fraction of your training data.
部分训练数据上的求和

298
00:10:02,440 --> 00:10:04,590
And finally, having computed
最后 当这些求和计算完成之后

299
00:10:05,050 --> 00:10:06,260
all of these things, they could
求和结果

300
00:10:06,400 --> 00:10:07,520
then send their results to
会被发送到

301
00:10:07,680 --> 00:10:09,400
a centralized server, which can
一台中心计算服务器上

302
00:10:09,640 --> 00:10:12,760
then add up the partial sums.
这台服务器将对结果进行再次求和

303
00:10:13,320 --> 00:10:14,410
This corresponds to adding up
这等同于

304
00:10:14,500 --> 00:10:17,000
those tenth i or
对临时变量temp i

305
00:10:17,550 --> 00:10:21,880
tenth ij variables, which
或者temp ij进行求和

306
00:10:22,100 --> 00:10:23,610
were computed locally on machine
而这些临时标量

307
00:10:23,980 --> 00:10:25,390
number i, and so
是第i台计算机算出来的

308
00:10:25,420 --> 00:10:26,800
the centralized server can sum
中心计算服务器

309
00:10:27,050 --> 00:10:28,220
these things up and get
对这些临时变量求和

310
00:10:28,450 --> 00:10:30,230
the overall cost function
得到了总的成本函数值

311
00:10:30,870 --> 00:10:32,750
and get the overall partial derivative,
以及总的偏导数值

312
00:10:33,390 --> 00:10:35,710
which you can then pass through the advanced optimization algorithm.
然后你可以将这两个值传给高级优化函数

313
00:10:36,890 --> 00:10:38,100
So, more broadly, by taking
因此 更广义的来说

314
00:10:39,080 --> 00:10:40,790
other learning algorithms and
通过将机器学习算法

315
00:10:41,020 --> 00:10:42,430
expressing them in sort of
表示为

316
00:10:42,720 --> 00:10:43,800
summation form or by expressing
求和的形式

317
00:10:44,340 --> 00:10:45,660
them in terms of computing sums
或者是

318
00:10:45,990 --> 00:10:47,100
of functions over the training set,
训练数据的函数求和形式

319
00:10:47,740 --> 00:10:49,290
you can use the MapReduce technique to
你就可以运用map reduce技术

320
00:10:49,440 --> 00:10:51,420
parallelize other learning algorithms as well,
来将算法并行化

321
00:10:51,710 --> 00:10:53,310
and scale them to very large training sets.
这样就可以处理大规模数据了

322
00:10:54,340 --> 00:10:55,850
Finally, as one last comment,
最后再提醒一点

323
00:10:56,390 --> 00:10:57,170
so far we have been
目前我们只讨论了

324
00:10:57,510 --> 00:10:59,630
discussing MapReduce algorithms as
运用map reduce技术

325
00:10:59,850 --> 00:11:01,400
allowing you to parallelize over
在多台计算机上

326
00:11:02,090 --> 00:11:03,630
multiple computers, maybe multiple
实现并行计算

327
00:11:03,940 --> 00:11:05,020
computers in a computer
也许是一个计算机集群

328
00:11:05,220 --> 00:11:08,060
cluster or over multiple computers in the data center.
也许是一个数据中心中的多台计算机

329
00:11:09,150 --> 00:11:10,580
It turns out that sometimes even
但实际上

330
00:11:10,770 --> 00:11:12,010
if you have just a single computer,
有时即使我们只有一台计算机

331
00:11:13,090 --> 00:11:14,390
MapReduce can also be applicable.
我们也可以运用map reduce技术

332
00:11:15,530 --> 00:11:16,970
In particular, on many single
具体来说

333
00:11:17,320 --> 00:11:18,510
computers now, you can have
现在的许多计算机

334
00:11:18,780 --> 00:11:20,520
multiple processing cores.
都是多核的

335
00:11:21,170 --> 00:11:21,860
You can have multiple CPUs,
你可以有多个CPU

336
00:11:22,180 --> 00:11:23,120
and within each CPU you can
而每个CPU

337
00:11:23,240 --> 00:11:26,170
have multiple proc cores.
又包括多个核

338
00:11:26,310 --> 00:11:27,170
If you have a large training
如果你有一个

339
00:11:27,520 --> 00:11:28,460
set, what you can
很大的训练样本

340
00:11:28,570 --> 00:11:29,540
do if, say, you have
那么你可以

341
00:11:29,740 --> 00:11:31,520
a computer with 4
使用一台

342
00:11:31,880 --> 00:11:33,400
computing cores, what you
四核的计算机

343
00:11:33,460 --> 00:11:34,390
can do is, even on a
即使在这样一台计算机上

344
00:11:34,550 --> 00:11:35,580
single computer you can split the
你依然可以

345
00:11:35,760 --> 00:11:37,680
training sets into pieces and
将训练样本分成几份

346
00:11:37,810 --> 00:11:39,140
send the training set to different
然后让每一个核

347
00:11:39,660 --> 00:11:40,960
cores within a single box,
处理其中一份子样本

348
00:11:41,220 --> 00:11:42,570
like within a single desktop computer
这样 在单台计算机

349
00:11:43,240 --> 00:11:45,070
or a single server and use
或者单个服务器上

350
00:11:45,370 --> 00:11:47,200
MapReduce this way to divvy up work load.
你也可以利用map reduce技术来划分计算任务

351
00:11:48,000 --> 00:11:49,010
Each of the cores can then
每一个核

352
00:11:49,200 --> 00:11:50,240
carry out the sum over,
可以处理

353
00:11:50,950 --> 00:11:52,000
say, one quarter of your
比方说四分之一

354
00:11:52,050 --> 00:11:53,440
training set, and then they
训练样本的求和

355
00:11:53,510 --> 00:11:55,090
can take the partial sums and
然后我们再将

356
00:11:55,510 --> 00:11:56,890
combine them, in order
这些部分和汇总

357
00:11:57,220 --> 00:11:59,360
to get the summation over the entire training set.
最终得到整个训练样本上的求和

358
00:11:59,750 --> 00:12:01,280
The advantage of thinking
相对于多台计算机

359
00:12:01,600 --> 00:12:02,880
about MapReduce this way, as
这样在单台计算机上

360
00:12:03,350 --> 00:12:04,760
paralyzing over cause within a
使用map reduce技术

361
00:12:04,900 --> 00:12:06,720
single machine, rather than parallelizing over
的一个优势

362
00:12:06,910 --> 00:12:08,480
multiple machines is that,
在于

363
00:12:09,060 --> 00:12:09,970
this way you don't have to
现在你不需要

364
00:12:10,100 --> 00:12:11,740
worry about network latency, because
担心网络延时问题

365
00:12:12,020 --> 00:12:13,380
all the communication, all the
因为所有的通讯

366
00:12:13,460 --> 00:12:14,810
sending of the  [xx]
所有的来回数据传输

367
00:12:15,890 --> 00:12:18,020
back and forth, all that happens within a single machine.
都发生在一台计算机上

368
00:12:18,420 --> 00:12:20,170
And so network latency becomes
因此 相比于使用数据中心的

369
00:12:20,590 --> 00:12:21,530
much less of an issue compared
多台计算机

370
00:12:21,960 --> 00:12:23,050
to if you were using this
现在网络延时的影响

371
00:12:23,540 --> 00:12:26,080
to over different computers within the data sensor.
小了许多

372
00:12:27,040 --> 00:12:27,930
Finally, one last caveat on
最后 关于在一台多核计算机上的并行运算

373
00:12:27,990 --> 00:12:30,740
parallelizing within a multi-core machine.
我再提醒一点

374
00:12:31,580 --> 00:12:32,600
Depending on the details
这取决于你的编程实现细节

375
00:12:32,930 --> 00:12:34,290
of your implementation, if you have a
如果你有一台

376
00:12:34,610 --> 00:12:35,920
multi-core machine and if you
多核计算机

377
00:12:36,190 --> 00:12:38,130
have certain numerical linear algebra libraries.
并且使用了某个线性代数函数库

378
00:12:39,350 --> 00:12:40,490
It turns out that the sum numerical linear algebra libraries
那么请注意 某些线性代数函数库

379
00:12:41,490 --> 00:12:43,940
that can automatically parallelize their
会自动利用多个核

380
00:12:44,680 --> 00:12:47,500
linear algebra operations across multiple cores within the machine.
并行的完成线性代数运算

381
00:12:48,770 --> 00:12:50,140
So if you're fortunate enough to
因此 如果你幸运的

382
00:12:50,280 --> 00:12:51,300
be using one of those numerical linear algebra
使用了这种

383
00:12:51,710 --> 00:12:52,980
libraries and certainly
线性代数函数库

384
00:12:53,640 --> 00:12:55,120
this does not apply to every single library.
当然 并不是每个函数库都会自动并行

385
00:12:55,830 --> 00:12:57,800
If you're using one of those libraries and.
但如果你用了这样一个函数库

386
00:12:58,200 --> 00:13:00,680
If you have a very good vectorizing implementation of the learning algorithm.
并且你有一个矢量化得很好的算法实现

387
00:13:01,720 --> 00:13:02,710
Sometimes you can just implement
那么 有时你只需要

388
00:13:03,160 --> 00:13:05,060
you standard learning algorithm in
按照标准的矢量化方式

389
00:13:05,150 --> 00:13:06,460
a vectorized fashion and not
实现机器学习算法

390
00:13:06,710 --> 00:13:08,630
worry about parallelization and numerical linear algebra libararies
而不用管多核并行的问题

391
00:13:10,030 --> 00:13:12,480
could take care of some of it for you.
因为你的线性代数函数库会自动帮助你完成多核并行的工作

392
00:13:12,620 --> 00:13:14,710
So you don't need to implement [xx] but.
因此 这时你不需要使用map reduce技术

393
00:13:14,860 --> 00:13:16,570
for other any problems, taking advantage
但是 对于其他的问题

394
00:13:17,180 --> 00:13:18,660
of this sort of map reducing commentation,
使用基于map reduce的实现

395
00:13:19,240 --> 00:13:20,690
finding and using this
寻找并使用

396
00:13:20,880 --> 00:13:22,070
MapReduce formulation and to
适合map reduce的问题表述

397
00:13:22,170 --> 00:13:23,410
paralelize a cross coarse except
然后实现一个

398
00:13:23,890 --> 00:13:24,970
yourself might be a
多核并行的算法

399
00:13:25,070 --> 00:13:27,310
good idea as well and could let you speed up your learning algorithm.
可能是个好主意 它将会加速你的机器学习算法

400
00:13:29,860 --> 00:13:31,390
In this video, we talked about
在这个视频中

401
00:13:31,730 --> 00:13:33,650
the MapReduce approach to parallelizing
我们介绍了map reduce技术

402
00:13:34,460 --> 00:13:35,850
machine learning by taking a
它可以通过

403
00:13:36,070 --> 00:13:37,450
data and spreading them across
将数据分配到多台计算机的方式

404
00:13:37,830 --> 00:13:39,660
many computers in the data center.
来并行化机器学习算法

405
00:13:40,160 --> 00:13:41,930
Although these ideas are
实际上这种方法

406
00:13:42,290 --> 00:13:43,970
critical to paralysing across multiple
也可以利用

407
00:13:44,290 --> 00:13:45,400
cores within a single computer
单台计算机的多个核

408
00:13:46,870 --> 00:13:47,150
as well.
来实现并行

409
00:13:47,650 --> 00:13:48,600
Today there are some good
今天 网上有许多优秀的

410
00:13:49,260 --> 00:13:51,080
open source implementations of MapReduce,
开源map reduce实现

411
00:13:51,440 --> 00:13:52,210
so there are many users
实际上 一个称为Hadoop

412
00:13:52,710 --> 00:13:54,480
in open source system called
的开源系统

413
00:13:54,890 --> 00:13:55,820
Hadoop and using either your
已经拥有了众多的用户

414
00:13:56,010 --> 00:13:57,580
own implementation or using someone
通过自己实现map reduce算法

415
00:13:57,850 --> 00:13:59,770
else's open source implementation, you
或者使用别人的开源实现

416
00:13:59,920 --> 00:14:01,090
can use these ideas to
你就可以利用map reduce技术

417
00:14:01,410 --> 00:14:02,730
parallelize learning algorithms and
来并行化机器学习算法

418
00:14:03,540 --> 00:14:04,580
get them to run on much
这样你的算法

419
00:14:04,950 --> 00:14:05,980
larger data sets than is
将能够处理

420
00:14:06,320 --> 00:14:07,770
possible using just a single machine.
单台计算机处理不了的大数据