srt/1 - 4 - Unsupervised Learning (14 min).srt

﻿1
00:00:00,380 --> 00:00:01,550
In this video, we'll talk about
在这段视频中 我们将讨论


(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:01,670 --> 00:00:02,690
the second major type of machine
第二种主要的机器学习问题

3
00:00:03,010 --> 00:00:05,030
learning problem, called Unsupervised Learning.
叫做无监督学习

4
00:00:06,300 --> 00:00:08,500
In the last video, we talked about Supervised Learning.
在上一节视频中 我们已经讲过了监督学习

5
00:00:09,250 --> 00:00:10,700
Back then, recall data sets
回想起上次的数据集

6
00:00:11,020 --> 00:00:12,670
that look like this, where each
每个样本

7
00:00:12,890 --> 00:00:15,150
example was labeled either
都已经被标明为

8
00:00:15,610 --> 00:00:16,900
as a positive or negative example,
正样本或者负样本

9
00:00:17,530 --> 00:00:19,800
whether it was a benign or a malignant tumor.
即良性或恶性肿瘤

10
00:00:20,850 --> 00:00:21,920
So for each example in Supervised
因此 对于监督学习中的每一个样本

11
00:00:22,410 --> 00:00:24,270
Learning, we were told explicitly what
我们已经被清楚地告知了

12
00:00:24,440 --> 00:00:25,760
is the so-called right answer,
什么是所谓的正确答案

13
00:00:26,490 --> 00:00:27,580
whether it's benign or malignant.
即它们是良性还是恶性

14
00:00:28,550 --> 00:00:30,210
In Unsupervised Learning, we're given
在无监督学习中

15
00:00:30,540 --> 00:00:31,720
data that looks different
我们用的数据会和监督学习里的看起来有些不一样

16
00:00:31,950 --> 00:00:32,910
than data that looks like
在无监督学习中

17
00:00:33,190 --> 00:00:34,600
this that doesn't have
没有属性或标签这一概念

18
00:00:34,720 --> 00:00:35,920
any labels or that all
也就是说所有的数据

19
00:00:36,130 --> 00:00:37,460
has the same label or really no labels.
都是一样的 没有区别

20
00:00:39,680 --> 00:00:40,740
So we're given the data set and
所以在无监督学习中 我们只有一个数据集

21
00:00:40,980 --> 00:00:42,460
we're not told what to
没人告诉我们该怎么做

22
00:00:42,560 --> 00:00:43,290
do with it and we're not
我们也不知道

23
00:00:43,640 --> 00:00:44,800
told what each data point is.
每个数据点究竟是什么意思

24
00:00:45,290 --> 00:00:47,190
Instead we're just told, here is a data set.
相反 它只告诉我们 现在有一个数据集

25
00:00:47,870 --> 00:00:49,650
Can you find some structure in the data?
你能在其中找到某种结构吗？

26
00:00:50,480 --> 00:00:51,670
Given this data set, an
对于给定的数据集

27
00:00:52,350 --> 00:00:53,940
Unsupervised Learning algorithm might decide that
无监督学习算法可能判定

28
00:00:54,060 --> 00:00:56,090
the data lives in two different clusters.
该数据集包含两个不同的聚类

29
00:00:56,800 --> 00:00:57,960
And so there's one cluster
你看 这是第一个聚类

30
00:00:59,120 --> 00:00:59,910
and there's a different cluster.
然后这是另一个聚类

31
00:01:01,110 --> 00:01:02,710
And yes, Supervised Learning algorithm may
你猜对了 无监督学习算法

32
00:01:03,040 --> 00:01:05,070
break these data into these two separate clusters.
会把这些数据分成两个不同的聚类

33
00:01:06,410 --> 00:01:08,000
So this is called a clustering algorithm.
所以这就是所谓的聚类算法

34
00:01:08,860 --> 00:01:10,310
And this turns out to be used in many places.
实际上它被用在许多地方

35
00:01:11,930 --> 00:01:13,310
One example where clustering
我们来举一个聚类算法的栗子

36
00:01:13,530 --> 00:01:14,860
is used is in Google
Google 新闻的例子

37
00:01:15,060 --> 00:01:16,160
News and if you have not
如果你还没见过这个页面的话

38
00:01:16,360 --> 00:01:17,320
seen this before, you can actually
你可以到这个URL

39
00:01:18,210 --> 00:01:19,040
go to this URL news.google.com
news.google.com

40
00:01:19,830 --> 00:01:20,460
to take a look.
去看看

41
00:01:21,280 --> 00:01:22,970
What Google News does is everyday
谷歌新闻每天都在干什么呢？

42
00:01:23,480 --> 00:01:24,220
it goes and looks at tens
他们每天会去收集

43
00:01:24,470 --> 00:01:25,430
of thousands or hundreds of
成千上万的

44
00:01:25,720 --> 00:01:26,740
thousands of new stories on the
网络上的新闻

45
00:01:26,800 --> 00:01:29,410
web and it groups them into cohesive news stories.
然后将他们分组 组成一个个新闻专题

46
00:01:30,730 --> 00:01:31,690
For example, let's look here.
比如 让我们来看看这里

47
00:01:33,380 --> 00:01:35,370
The URLs here link
这里的URL链接

48
00:01:35,910 --> 00:01:37,260
to different news stories
连接着不同的

49
00:01:38,010 --> 00:01:40,110
about the BP Oil Well story.
有关BP油井事故的报道

50
00:01:41,300 --> 00:01:42,160
So, let's click on
所以 让我们点击

51
00:01:42,260 --> 00:01:43,090
one of these URL's and we'll
这些URL中的一个

52
00:01:43,550 --> 00:01:44,780
click on one of these URL's.
恩 让我们点一个

53
00:01:45,100 --> 00:01:46,970
What I'll get to is a web page like this.
然后我们会来到这样一个网页

54
00:01:47,210 --> 00:01:48,390
Here's a Wall Street
这是一篇来自华尔街日报的

55
00:01:48,590 --> 00:01:50,180
Journal article about, you know, the BP
有关……你懂的

56
00:01:51,110 --> 00:01:52,530
Oil Well Spill stories of
有关BP油井泄漏事故的报道

57
00:01:52,920 --> 00:01:54,350
"BP Kills Macondo",
标题为《BP杀死了Macondo》

58
00:01:54,590 --> 00:01:55,700
which is a name of the
Macondo 是个地名

59
00:01:55,980 --> 00:01:57,960
spill and if you
就是那个漏油事故的地方

60
00:01:58,020 --> 00:01:59,360
click on a different URL
如果你从这个组里点击一个不同的URL

61
00:02:00,690 --> 00:02:02,500
from that group then you might get the different story.
那么你可能会得到不同的新闻

62
00:02:02,950 --> 00:02:04,760
Here's the CNN story about a
这里是一则CNN的新闻

63
00:02:04,820 --> 00:02:06,090
game, the BP Oil Spill,
是一个有关BP石油泄漏的视频

64
00:02:07,090 --> 00:02:08,180
and if you click on yet
如果你再点击第三个链接

65
00:02:08,740 --> 00:02:10,990
a third link, then you might get a different story.
又会出现不同的新闻

66
00:02:11,440 --> 00:02:13,380
Here's the UK Guardian story
这边是英国卫报的报道

67
00:02:13,940 --> 00:02:15,510
about the BP Oil Spill.
也是关于BP石油泄漏

68
00:02:16,530 --> 00:02:17,790
So what Google News has done
所以 谷歌新闻所做的就是

69
00:02:17,990 --> 00:02:19,440
is look for tens of thousands of
去搜索成千上万条新闻

70
00:02:19,490 --> 00:02:22,170
news stories and automatically cluster them together.
然后自动的将他们聚合在一起

71
00:02:23,030 --> 00:02:24,660
So, the news stories that are all
因此 有关同一主题的

72
00:02:25,080 --> 00:02:27,010
about the same topic get displayed together.
新闻被显示在一起

73
00:02:27,210 --> 00:02:29,170
It turns out that
其实

74
00:02:29,380 --> 00:02:31,020
clustering algorithms and Unsupervised Learning
聚类算法和无监督学习算法

75
00:02:31,530 --> 00:02:33,550
algorithms are used in many other problems as well.
也可以被用于许多其他的问题

76
00:02:35,320 --> 00:02:36,690
Here's one on understanding genomics.
这里我们举个它在基因组学中的应用

77
00:02:38,270 --> 00:02:40,510
Here's an example of DNA microarray data.
下面是一个关于基因芯片的例子

78
00:02:40,990 --> 00:02:42,230
The idea is put
基本的思想是

79
00:02:42,430 --> 00:02:44,360
a group of different individuals and
给定一组不同的个体

80
00:02:44,510 --> 00:02:45,590
for each of them, you measure
对于每个个体

81
00:02:46,100 --> 00:02:48,580
how much they do or do not have a certain gene.
检测它们是否拥有某个特定的基因

82
00:02:49,050 --> 00:02:51,640
Technically you measure how much certain genes are expressed.
也就是说，你要去分析有多少基因显现出来了

83
00:02:52,000 --> 00:02:54,190
So these colors, red, green,
因此 这些颜色 红 绿

84
00:02:54,930 --> 00:02:56,210
gray and so on, they
灰 等等 它们

85
00:02:56,340 --> 00:02:57,500
show the degree to which
展示了这些不同的个体

86
00:02:57,780 --> 00:02:59,440
different individuals do or
是否拥有一个特定基因

87
00:02:59,510 --> 00:03:01,270
do not have a specific gene.
的不同程度

88
00:03:02,500 --> 00:03:03,400
And what you can do is then
然后你所能做的就是

89
00:03:03,610 --> 00:03:05,070
run a clustering algorithm to group
运行一个聚类算法

90
00:03:05,380 --> 00:03:07,140
individuals into different categories
把不同的个体归入不同的类

91
00:03:07,780 --> 00:03:08,810
or into different types of people.
或归为不同类型的人

92
00:03:10,230 --> 00:03:11,660
So this is Unsupervised Learning because
这就是无监督学习

93
00:03:11,930 --> 00:03:14,010
we're not telling the algorithm in advance
我们没有提前告知这个算法

94
00:03:14,590 --> 00:03:15,690
that these are type 1 people,
这些是第一类的人

95
00:03:16,130 --> 00:03:17,420
those are type 2 persons, those
这些是第二类的人

96
00:03:17,560 --> 00:03:18,650
are type 3 persons and so
这些是第三类的人等等

97
00:03:19,610 --> 00:03:22,390
on and instead what were saying is yeah here's a bunch of data.
相反我们只是告诉算法 你看 这儿有一堆数据

98
00:03:23,110 --> 00:03:24,030
I don't know what's in this data.
我不知道这个数据是什么东东

99
00:03:24,750 --> 00:03:25,870
I don't know who's and what type.
我不知道里面都有些什么类型 叫什么名字

100
00:03:26,150 --> 00:03:26,940
I don't even know what the different
我甚至不知道都有哪些类型

101
00:03:27,260 --> 00:03:28,480
types of people are, but can
但是

102
00:03:28,610 --> 00:03:30,210
you automatically find structure in
请问你可以自动的找到这些数据中的类型吗？

103
00:03:30,360 --> 00:03:31,260
the data from the you automatically
然后自动的

104
00:03:32,180 --> 00:03:33,620
cluster the individuals into these types
按得到的类型把这些个体分类

105
00:03:33,870 --> 00:03:35,490
that I don't know in advance?
虽然事先我并不知道哪些类型

106
00:03:35,890 --> 00:03:37,610
Because we're not giving the algorithm
因为对于这些数据样本来说

107
00:03:38,160 --> 00:03:40,140
the right answer for the
我们没有给算法一个

108
00:03:40,370 --> 00:03:41,270
examples in my data
正确答案

109
00:03:41,590 --> 00:03:43,090
set, this is Unsupervised Learning.
所以 这就是无监督学习

110
00:03:44,290 --> 00:03:47,040
Unsupervised Learning or clustering is used for a bunch of other applications.
无监督学习或聚类算法在其他领域也有着大量的应用

111
00:03:48,340 --> 00:03:50,340
It's used to organize large computer clusters.
它被用来组织大型的计算机集群

112
00:03:51,390 --> 00:03:52,530
I had some friends looking at
我有一些朋友在管理

113
00:03:52,680 --> 00:03:53,970
large data centers, that is
大型数据中心 也就是

114
00:03:54,180 --> 00:03:55,970
large computer clusters and trying
大型计算机集群 并试图

115
00:03:56,230 --> 00:03:57,470
to figure out which machines tend to
找出哪些机器趋向于

116
00:03:57,590 --> 00:03:59,130
work together and if
协同工作

117
00:03:59,200 --> 00:04:00,270
you can put those machines together,
如果你把这些机器放在一起

118
00:04:01,100 --> 00:04:03,220
you can make your data center work more efficiently.
你就可以让你的数据中心更高效地工作

119
00:04:04,810 --> 00:04:06,820
This second application is on social network analysis.
第二种应用是用于社交网络的分析

120
00:04:07,890 --> 00:04:09,230
So given knowledge about which friends
所以 如果可以得知

121
00:04:09,630 --> 00:04:10,840
you email the most or
哪些朋友你用email联系的最多

122
00:04:10,880 --> 00:04:12,150
given your Facebook friends or
或者知道你的Facebook好友

123
00:04:12,180 --> 00:04:14,150
your Google+ circles, can
或者你Google+里的朋友

124
00:04:14,290 --> 00:04:16,380
we automatically identify which are
知道了这些之后

125
00:04:16,450 --> 00:04:17,950
cohesive groups of friends,
我们是否可以自动识别

126
00:04:18,460 --> 00:04:19,420
also which are groups of people
哪些是很要好的朋友组

127
00:04:20,230 --> 00:04:21,010
that all know each other?
哪些仅仅是互相认识的朋友组

128
00:04:22,540 --> 00:04:22,880
Market segmentation.
还有在市场分割中的应用

129
00:04:24,680 --> 00:04:26,780
Many companies have huge databases of customer information.
许多公司拥有庞大的客户信息数据库

130
00:04:27,700 --> 00:04:28,410
So, can you look at this
那么 给你一个

131
00:04:28,510 --> 00:04:30,000
customer data set and automatically
客户数据集 你能否

132
00:04:30,740 --> 00:04:32,340
discover market segments and automatically
自动找出不同的市场分割

133
00:04:33,340 --> 00:04:35,290
group your customers into different
并自动将你的客户分到不同的

134
00:04:35,820 --> 00:04:37,400
market segments so that
细分市场中

135
00:04:37,710 --> 00:04:39,490
you can automatically and more
从而有助于我在

136
00:04:39,650 --> 00:04:41,580
efficiently sell or market
不同的细分市场中

137
00:04:41,890 --> 00:04:43,250
your different market segments together?
进行更有效的销售

138
00:04:44,260 --> 00:04:45,580
Again, this is Unsupervised Learning
这也是无监督学习

139
00:04:45,820 --> 00:04:46,720
because we have all this
我们现在有

140
00:04:46,900 --> 00:04:48,340
customer data, but we don't
这些客户数据

141
00:04:48,590 --> 00:04:49,710
know in advance what are the
但我们预先并不知道

142
00:04:49,790 --> 00:04:51,270
market segments and for
有哪些细分市场

143
00:04:51,440 --> 00:04:52,570
the customers in our data
而且

144
00:04:52,660 --> 00:04:53,590
set, you know, we don't know in
对于我们数据集的某个客户

145
00:04:53,690 --> 00:04:54,700
advance who is in
我们也不能预先知道

146
00:04:54,800 --> 00:04:55,840
market segment one, who is
谁属于细分市场一

147
00:04:55,940 --> 00:04:57,800
in market segment two, and so on.
谁又属于细分市场二等等

148
00:04:57,930 --> 00:05:00,630
But we have to let the algorithm discover all this just from the data.
但我们必须让这个算法自己去从数据中发现这一切

149
00:05:01,970 --> 00:05:03,140
Finally, it turns out that Unsupervised
最后

150
00:05:03,690 --> 00:05:05,620
Learning is also used for
事实上无监督学习也被用于

151
00:05:06,090 --> 00:05:08,060
surprisingly astronomical data analysis
天文数据分析

152
00:05:08,890 --> 00:05:10,390
and these clustering algorithms gives
通过这些聚类算法 我们发现了许多

153
00:05:10,580 --> 00:05:12,440
surprisingly interesting useful theories
惊人的、有趣的 以及实用的

154
00:05:12,900 --> 00:05:15,610
of how galaxies are born.
关于星系是如何诞生的理论

155
00:05:15,880 --> 00:05:17,620
All of these are examples of clustering,
所有这些都是聚类算法的例子

156
00:05:18,400 --> 00:05:20,550
which is just one type of Unsupervised Learning.
而聚类只是无监督学习的一种

157
00:05:21,530 --> 00:05:22,470
Let me tell you about another one.
现在让我来告诉你另一种

158
00:05:23,200 --> 00:05:25,020
I'm gonna tell you about the cocktail party problem.
我先来介绍一下鸡尾酒宴问题

159
00:05:26,310 --> 00:05:28,270
So, you've been to cocktail parties before, right?
恩 我想你参加过鸡尾酒会的 是吧？

160
00:05:28,440 --> 00:05:30,080
Well, you can imagine there's a
嗯 想象一下

161
00:05:30,300 --> 00:05:31,690
party, room full of people, all
有一个宴会 有一屋子的人

162
00:05:31,870 --> 00:05:32,930
sitting around, all talking at the
大家都坐在一起

163
00:05:32,970 --> 00:05:34,390
same time and there are
而且在同时说话

164
00:05:34,480 --> 00:05:36,230
all these overlapping voices because everyone
有许多声音混杂在一起

165
00:05:36,590 --> 00:05:37,920
is talking at the same time, and
因为每个人都是在同一时间说话的

166
00:05:38,070 --> 00:05:39,730
it is almost hard to hear the person in front of you.
在这种情况下你很难听清楚你面前的人说的话

167
00:05:40,690 --> 00:05:41,970
So maybe at a
因此 比如有这样一个场景

168
00:05:42,020 --> 00:05:43,990
cocktail party with two people,
宴会上只有两个人

169
00:05:45,690 --> 00:05:46,670
two people talking at the same
两个人

170
00:05:46,770 --> 00:05:48,090
time, and it's a somewhat
同时说话

171
00:05:48,740 --> 00:05:49,710
small cocktail party.
恩 这是个很小的鸡尾酒宴会

172
00:05:50,690 --> 00:05:51,630
And we're going to put two
我们准备好了两个麦克风

173
00:05:51,890 --> 00:05:53,080
microphones in the room so
把它们放在房间里

174
00:05:54,060 --> 00:05:55,640
there are microphones, and because
然后

175
00:05:56,050 --> 00:05:57,430
these microphones are at two
因为这两个麦克风距离这两个人

176
00:05:57,560 --> 00:05:58,900
different distances from the
的距离是不同的

177
00:05:58,990 --> 00:06:01,250
speakers, each microphone records
每个麦克风都记录下了

178
00:06:01,830 --> 00:06:04,720
a different combination of these two speaker voices.
来自两个人的声音的不同组合

179
00:06:05,810 --> 00:06:06,970
Maybe speaker one is a
也许A的声音

180
00:06:07,120 --> 00:06:08,320
little louder in microphone one
在第一个麦克风里的声音会响一点

181
00:06:09,120 --> 00:06:10,680
and maybe speaker two is a
也许B的声音

182
00:06:10,800 --> 00:06:12,350
little bit louder on microphone 2
在第二个麦克风里会比较响一些

183
00:06:12,560 --> 00:06:14,040
because the 2 microphones are
因为2个麦克风

184
00:06:14,230 --> 00:06:15,950
at different positions relative to
的位置相对于

185
00:06:16,400 --> 00:06:19,020
the 2 speakers, but each
2个说话者的位置是不同的

186
00:06:19,250 --> 00:06:20,390
microphone would cause an overlapping
但每个麦克风都会录到

187
00:06:20,970 --> 00:06:22,590
combination of both speakers' voices.
来自两个说话者的重叠部分的声音

188
00:06:23,960 --> 00:06:25,500
So here's an actual recording
这里有一个

189
00:06:26,520 --> 00:06:29,280
of two speakers recorded by a researcher.
来自一个研究员录下的两个说话者的声音

190
00:06:29,740 --> 00:06:30,950
Let me play for you the
让我先放给你听第一个

191
00:06:31,060 --> 00:06:32,760
first, what the first microphone sounds like.
这是第一个麦克风录到的录音：

192
00:06:33,560 --> 00:06:34,800
One (uno), two (dos),
一 (UNO)  二 (DOS)

193
00:06:35,070 --> 00:06:36,590
three (tres), four (cuatro), five
三 (TRES)  四 (CUATRO)  五 (CINCO)

194
00:06:37,060 --> 00:06:38,550
(cinco), six (seis), seven (siete),
六 (SEIS)  七 (SIETE)

195
00:06:38,990 --> 00:06:40,610
eight (ocho), nine (nueve), ten (y diez).
八 (ocho)  九 (NUEVE)  十 (Y DIEZ)

196
00:06:41,610 --> 00:06:42,650
All right, maybe not the most interesting cocktail
好吧 这大概不是什么有趣的酒会……

197
00:06:43,000 --> 00:06:44,270
party, there's two people
……在这个酒会上 有两个人

198
00:06:44,620 --> 00:06:45,670
counting from one to ten
各自从1数到10

199
00:06:46,010 --> 00:06:47,880
in two languages but you know.
但用的是两种不同语言

200
00:06:48,870 --> 00:06:49,760
What you just heard was the
你刚才听到的是

201
00:06:49,820 --> 00:06:52,500
first microphone recording, here's the second recording.
第一个麦克风的录音 这里是第二个的：

202
00:06:57,440 --> 00:06:58,040
Uno (one), dos (two), tres (three), cuatro
一 (UNO)  二 (DOS)  三 (TRES)

203
00:06:58,060 --> 00:06:58,730
(four), cinco (five), seis (six), siete (seven),
四 (CUATRO) 五 (CINCO)  六 (SEIS)  七 (SIETE)

204
00:06:59,160 --> 00:07:00,900
ocho (eight), nueve (nine) y diez (ten).
八 (ocho) 九 (NUEVE)  十 (Y DIEZ)

205
00:07:01,860 --> 00:07:02,850
So we can do, is take
所以 我们能做的就是把

206
00:07:03,380 --> 00:07:04,660
these two microphone recorders and give
这两个录音输入

207
00:07:04,980 --> 00:07:06,480
them to an Unsupervised Learning algorithm
一种无监督学习算法中

208
00:07:07,010 --> 00:07:08,560
called the cocktail party algorithm,
称为“鸡尾酒会算法”

209
00:07:08,780 --> 00:07:09,910
and tell the algorithm
让这个算法

210
00:07:10,450 --> 00:07:12,140
- find structure in this data for you.
帮你找出其中蕴含的分类

211
00:07:12,250 --> 00:07:14,010
And what the algorithm will do
然后这个算法

212
00:07:14,410 --> 00:07:15,730
is listen to these
就会去听这些

213
00:07:15,980 --> 00:07:17,990
audio recordings and say, you
录音 并且你知道

214
00:07:18,140 --> 00:07:19,020
know it sounds like the
这听起​​来像

215
00:07:19,360 --> 00:07:20,950
two audio recordings are being
两个音频录音

216
00:07:21,240 --> 00:07:22,450
added together or that have being
被叠加在一起

217
00:07:22,670 --> 00:07:25,220
summed together to produce these recordings that we had.
所以我们才能听到这样的效果

218
00:07:25,990 --> 00:07:27,330
Moreover, what the cocktail party
此外 这个算法

219
00:07:27,710 --> 00:07:29,210
algorithm will do is separate
还会分离出

220
00:07:29,570 --> 00:07:30,810
out these two audio sources
这两个被

221
00:07:31,480 --> 00:07:32,700
that were being added or being
叠加到一起的

222
00:07:33,000 --> 00:07:34,240
summed together to form other
音频源

223
00:07:34,410 --> 00:07:35,600
recordings and, in fact,
事实上

224
00:07:36,200 --> 00:07:38,630
here's the first output of the cocktail party algorithm.
这是我们的鸡尾酒会算法的第一个输出

225
00:07:39,790 --> 00:07:41,910
One, two, three, four,
一 二 三 四

226
00:07:42,590 --> 00:07:46,270
five, six, seven, eight, nine, ten.
五 六 七 八 九 十

227
00:07:47,630 --> 00:07:48,780
So, I separated out the English
所以我在一个录音中

228
00:07:49,240 --> 00:07:51,220
voice in one of the recordings.
分离出了英文声音

229
00:07:52,460 --> 00:07:53,300
And here's the second of it.
这是第二个输出

230
00:07:53,380 --> 00:07:55,280
Uno, dos, tres, quatro, cinco,
Uno dos tres quatro cinco

231
00:07:55,980 --> 00:07:59,830
seis, siete, ocho, nueve y diez.
seis siete ocho nueve y diez

232
00:08:00,270 --> 00:08:01,180
Not too bad, to give you
听起来不错嘛

233
00:08:03,810 --> 00:08:05,270
one more example, here's another
再举一个例子 这是另一个录音

234
00:08:05,600 --> 00:08:07,370
recording of another similar situation,
也是在一个类似的场景下

235
00:08:08,060 --> 00:08:09,790
here's the first microphone :  One,
这是第一个麦克风的录音：

236
00:08:10,470 --> 00:08:12,430
two, three, four, five, six,
一 二 三 四 五 六

237
00:08:13,370 --> 00:08:15,710
seven, eight, nine, ten.
七 八 九 十

238
00:08:16,980 --> 00:08:17,920
OK so the poor guy's gone
OK 这个可怜的家伙从

239
00:08:18,180 --> 00:08:19,350
home from the cocktail party and
鸡尾酒会回家了

240
00:08:19,420 --> 00:08:21,880
he 's now sitting in a room by himself talking to his radio.
他现在独自一人坐在屋里 对着录音机自言自语

241
00:08:23,090 --> 00:08:24,130
Here's the second microphone recording.
这是第二个麦克风的录音

242
00:08:28,810 --> 00:08:31,800
One, two, three, four, five, six, seven, eight, nine, ten.
一 二 三 四 五 六 七 八 九 十

243
00:08:33,310 --> 00:08:34,160
When you give these two microphone
当你把这两个麦克风录音

244
00:08:34,610 --> 00:08:35,530
recordings to the same algorithm,
送给与刚刚相同的算法处理

245
00:08:36,360 --> 00:08:37,790
what it does, is again say,
它所做的还是

246
00:08:38,380 --> 00:08:39,470
you know, it sounds like there
告诉你 这听起来有

247
00:08:39,690 --> 00:08:41,370
are two audio sources, and moreover,
两种音频源 并且

248
00:08:42,410 --> 00:08:43,820
the album says, here is
算法说

249
00:08:44,070 --> 00:08:46,010
the first of the audio sources I found.
这里是我找到的第一个音频源

250
00:08:47,480 --> 00:08:49,300
One, two, three, four,
一 二 三 四

251
00:08:49,730 --> 00:08:53,430
five, six, seven, eight, nine, ten.
五 六 七 八 九 十

252
00:08:54,650 --> 00:08:56,110
So that wasn't perfect, it
恩 不是太完美

253
00:08:56,340 --> 00:08:57,360
got the voice, but it
提取到了人声

254
00:08:57,570 --> 00:08:59,070
also got a little bit of the music in there.
但还有一点音乐没有剔除掉

255
00:08:59,890 --> 00:09:01,360
Then here's the second output to the algorithm.
这是算法的第二个输出

256
00:09:10,020 --> 00:09:11,310
Not too bad, in that second
还好 在第二个输出中

257
00:09:11,540 --> 00:09:13,300
output it managed to get rid of the voice entirely.
它设法剔除掉了整个人声

258
00:09:13,760 --> 00:09:14,850
And just, you know,
只是清理了下音乐

259
00:09:15,020 --> 00:09:17,380
cleaned up the music, got rid of the counting from one to ten.
剔除了从一到十的计数

260
00:09:18,840 --> 00:09:20,090
So you might look at
所以 你可以看到

261
00:09:20,180 --> 00:09:21,750
an Unsupervised Learning algorithm like
像这样的无监督学习算法

262
00:09:21,950 --> 00:09:23,050
this and ask how
也许你想问 要实现这样的算法

263
00:09:23,250 --> 00:09:25,110
complicated this is to implement this, right?
很复杂吧？

264
00:09:25,330 --> 00:09:26,560
It seems like in order to,
看起来 为了

265
00:09:26,970 --> 00:09:28,870
you know, build this application, it seems
构建这个应用程序

266
00:09:28,930 --> 00:09:30,550
like to do this audio processing you
做这个音频处理

267
00:09:30,670 --> 00:09:31,430
need to write a ton of code
似乎需要写好多代码啊

268
00:09:32,240 --> 00:09:33,580
or maybe link into like a
或者需要链接到

269
00:09:33,690 --> 00:09:35,380
bunch of synthesizer Java libraries that
一堆处理音频的Java库

270
00:09:35,470 --> 00:09:37,150
process audio, seems like
貌似需要一个

271
00:09:37,240 --> 00:09:38,880
a really complicated program, to do
非常复杂的程序

272
00:09:39,060 --> 00:09:41,040
this audio, separating out audio and so on.
分离出音频等

273
00:09:42,460 --> 00:09:43,860
It turns out the algorithm, to
实际上

274
00:09:44,070 --> 00:09:45,640
do what you just heard, that
要实现你刚刚听到的效果

275
00:09:45,900 --> 00:09:47,280
can be done with one line
只需要一行代码就可以了

276
00:09:47,530 --> 00:09:49,270
of code - shown right here.
写在这里呢

277
00:09:50,640 --> 00:09:52,350
It take researchers a long
当然 研究人员

278
00:09:52,610 --> 00:09:54,060
time to come up with this line of code.
花了很长时间才想出这行代码的 ^-^

279
00:09:54,490 --> 00:09:56,090
I'm not saying this is an easy problem,
我不是说这是一个简单的问题

280
00:09:57,080 --> 00:09:57,980
But it turns out that when you
但事实上 如果你

281
00:09:58,180 --> 00:10:00,330
use the right programming environment, many learning
使用正确的编程环境 许多学习

282
00:10:00,670 --> 00:10:02,060
algorithms can be really short programs.
算法是用很短的代码写出来的

283
00:10:03,510 --> 00:10:04,700
So this is also why in
所以这也是为什么在

284
00:10:04,840 --> 00:10:05,890
this class we're going to
这门课中我们要

285
00:10:06,010 --> 00:10:07,430
use the Octave programming environment.
使用Octave的编程环境

286
00:10:08,550 --> 00:10:09,910
Octave, is free open source
Octave是一个免费的

287
00:10:10,120 --> 00:10:11,620
software, and using a
开放源码的软件

288
00:10:11,670 --> 00:10:13,130
tool like Octave or Matlab,
使用Octave或Matlab这类的工具

289
00:10:14,000 --> 00:10:15,400
many learning algorithms become just
许多学习算法

290
00:10:15,690 --> 00:10:17,910
a few lines of code to implement.
都可以用几行代码就可以实现

291
00:10:18,380 --> 00:10:19,400
Later in this class, I'll just teach
在后续课程中

292
00:10:19,620 --> 00:10:20,570
you a little bit about how to
我会教你如何使用Octave

293
00:10:20,720 --> 00:10:21,920
use Octave and you'll be
你会学到

294
00:10:22,050 --> 00:10:24,590
implementing some of these algorithms in Octave.
如何在Octave中实现这些算法

295
00:10:24,980 --> 00:10:26,050
Or if you have Matlab you can use that too.
或者 如果你有Matlab 你可以用它

296
00:10:27,120 --> 00:10:28,500
It turns out the Silicon Valley, for
事实上 在硅谷

297
00:10:28,620 --> 00:10:29,470
a lot of machine learning algorithms,
很多的机器学习算法

298
00:10:30,290 --> 00:10:31,310
what we do is first prototype
我们都是先用Octave

299
00:10:32,040 --> 00:10:33,900
our software in Octave because software
写一个程序原型

300
00:10:34,330 --> 00:10:35,250
in Octave makes it incredibly fast
因为在Octave中实现这些

301
00:10:35,540 --> 00:10:36,920
to implement these learning algorithms.
学习算法的速度快得让你无法想象

302
00:10:38,230 --> 00:10:39,110
Here each of these functions
在这里 每一个函数

303
00:10:39,720 --> 00:10:41,460
like for example the SVD
例如 SVD

304
00:10:41,680 --> 00:10:42,920
function that stands for singular
意思是奇异值分解

305
00:10:43,240 --> 00:10:44,520
value decomposition; but that turns
但这其实是解线性方程

306
00:10:44,640 --> 00:10:45,690
out to be a
的一个惯例

307
00:10:45,820 --> 00:10:48,420
linear algebra routine, that is just built into Octave.
它被内置在Octave软件中了

308
00:10:49,500 --> 00:10:50,390
If you were trying to do this
如果你试图

309
00:10:50,460 --> 00:10:51,490
in C++ or Java,
在C + +或Java中做这个

310
00:10:51,780 --> 00:10:53,040
this would be many many lines of
将需要写N多代码

311
00:10:53,180 --> 00:10:55,680
code linking complex C++ or Java libraries.
并且还要连接复杂的C + +或Java库

312
00:10:56,440 --> 00:10:57,490
So, you can implement this stuff as
所以 你可以在C++或

313
00:10:57,680 --> 00:10:58,690
C++ or Java
Java或Python中

314
00:10:59,050 --> 00:11:00,090
or Python, it's just much
实现这个算法 只是会

315
00:11:00,290 --> 00:11:02,090
more complicated to do so in those languages.
更加复杂而已

316
00:11:03,750 --> 00:11:05,060
What I've seen after having taught
在教授机器学习

317
00:11:05,300 --> 00:11:06,980
machine learning for almost a
将近10年后

318
00:11:07,210 --> 00:11:08,680
decade now, is that, you
我得出的一个经验就是

319
00:11:08,890 --> 00:11:10,340
learn much faster if you
如果你使用Octave的话

320
00:11:10,480 --> 00:11:11,700
use Octave as your
会学的更快

321
00:11:11,790 --> 00:11:14,070
programming environment, and if
并且如果你用

322
00:11:14,250 --> 00:11:15,570
you use Octave as your
Octave作为你的学习工具

323
00:11:16,260 --> 00:11:17,110
learning tool and as your
和开发原型的工具

324
00:11:17,240 --> 00:11:18,640
prototyping tool, it'll let
你的学习和开发过程

325
00:11:19,000 --> 00:11:21,280
you learn and prototype learning algorithms much more quickly.
会变得更快

326
00:11:22,640 --> 00:11:23,850
And in fact what many people will
而事实上在硅谷

327
00:11:23,990 --> 00:11:25,390
do to in the large Silicon
很多人会这样做

328
00:11:25,730 --> 00:11:27,360
Valley companies is in fact, use
他们会先用Octave

329
00:11:27,560 --> 00:11:29,020
an algorithm like Octave to first
来实现这样一个学习算法原型

330
00:11:29,370 --> 00:11:31,110
prototype the learning algorithm, and
只有在确定

331
00:11:31,510 --> 00:11:32,780
only after you've gotten it
这个算法可以工作后

332
00:11:32,860 --> 00:11:33,820
to work, then you migrate
才开始迁移到

333
00:11:34,390 --> 00:11:35,910
it to C++ or Java or whatever.
C++ Java或其它编译环境

334
00:11:36,890 --> 00:11:37,960
It turns out that by doing
事实证明 这样做

335
00:11:38,220 --> 00:11:39,070
things this way, you can often
实现的算法

336
00:11:39,400 --> 00:11:40,440
get your algorithm to work much
比你一开始就用C++

337
00:11:41,300 --> 00:11:43,050
faster than if you were starting out in C++.
实现的算法要快多了

338
00:11:44,440 --> 00:11:46,010
So, I know that as an
所以 我知道

339
00:11:46,100 --> 00:11:47,490
instructor, I get to
作为一个老师

340
00:11:47,570 --> 00:11:48,580
say "trust me on
我不能老是念叨：

341
00:11:48,730 --> 00:11:49,790
this one" only a finite
“在这个问题上相信我“

342
00:11:50,030 --> 00:11:51,420
number of times, but for
但对于

343
00:11:51,560 --> 00:11:52,720
those of you who've never used these
那些从来没有用过这种

344
00:11:53,330 --> 00:11:54,880
Octave type programming environments before,
类似Octave的编程环境的童鞋

345
00:11:55,240 --> 00:11:56,070
I am going to ask you
我还是要请你

346
00:11:56,130 --> 00:11:56,970
to trust me on this one,
相信我这一次

347
00:11:57,570 --> 00:11:58,950
and say that you, you will,
我认为

348
00:11:59,700 --> 00:12:01,180
I think your time, your development
你的时间 研发时间

349
00:12:01,700 --> 00:12:03,100
time is one of the most valuable resources.
是你最宝贵的资源之一

350
00:12:04,210 --> 00:12:05,570
And having seen lots
当见过很多的人这样做以后

351
00:12:05,800 --> 00:12:06,850
of people do this, I think
我觉得如果你也这样做

352
00:12:07,190 --> 00:12:08,460
you as a machine learning
作为一个机器学习的

353
00:12:08,850 --> 00:12:09,990
researcher, or machine learning developer
研究者和开发者

354
00:12:10,830 --> 00:12:12,080
will be much more productive if
你会更有效率

355
00:12:12,220 --> 00:12:13,010
you learn to start in prototype,
如果你学会先用Octave开发原型

356
00:12:13,580 --> 00:12:15,250
to start in Octave, in some other language.
而不是先用其他的编程语言来开发

357
00:12:17,570 --> 00:12:19,790
Finally, to wrap
最后 总结一下

358
00:12:20,090 --> 00:12:22,890
up this video, I have one quick review question for you.
这里有一个问题需要你来解答

359
00:12:24,400 --> 00:12:26,400
We talked about Unsupervised Learning, which
我们谈到了无监督学习

360
00:12:26,700 --> 00:12:27,670
is a learning setting where you
它是一种学习机制

361
00:12:27,760 --> 00:12:28,730
give the algorithm a ton
你给算法大量的数据

362
00:12:28,840 --> 00:12:30,120
of data and just ask it
要求它找出数据中

363
00:12:30,240 --> 00:12:32,900
to find structure in the data for us.
蕴含的类型结构

364
00:12:33,160 --> 00:12:35,170
Of the following four examples, which
以下的四个例子中

365
00:12:35,490 --> 00:12:36,410
ones, which of these four
哪一个

366
00:12:36,870 --> 00:12:37,630
do you think would will be
您认为是

367
00:12:37,720 --> 00:12:39,520
an Unsupervised Learning algorithm as
无监督学习算法

368
00:12:40,220 --> 00:12:41,950
opposed to Supervised Learning problem.
而不是监督学习问题

369
00:12:42,730 --> 00:12:43,590
For each of the four
对于每一个选项

370
00:12:43,860 --> 00:12:44,850
check boxes on the left,
在左边的复选框

371
00:12:45,640 --> 00:12:46,900
check the ones for which
选中你认为

372
00:12:47,210 --> 00:12:49,400
you think Unsupervised Learning
属于无监督学习的

373
00:12:49,700 --> 00:12:51,300
algorithm would be appropriate and
选项

374
00:12:51,440 --> 00:12:53,930
then click the button on the lower right to check your answer.
然后按一下右下角的按钮 提交你的答案

375
00:12:54,690 --> 00:12:57,030
So when the video pauses, please
所以 当视频暂停时

376
00:12:57,370 --> 00:12:58,750
answer the question on the slide.
请回答幻灯片上的这个问题

377
00:13:01,860 --> 00:13:03,950
So, hopefully, you've remembered the spam folder problem.
恩 没忘记垃圾邮件文件夹问题吧？

378
00:13:04,710 --> 00:13:06,310
If you have labeled data, you
如果你已经标记过数据

379
00:13:06,450 --> 00:13:07,680
know, with spam and
那么就有垃圾邮件和

380
00:13:07,800 --> 00:13:10,470
non-spam e-mail, we'd treat this as a Supervised Learning problem.
非垃圾邮件的区别 我们会将此视为一个监督学习问题

381
00:13:11,620 --> 00:13:13,870
The news story example, that's
新闻故事的例子

382
00:13:14,100 --> 00:13:15,370
exactly the Google News example
正是我们在本课中讲到的

383
00:13:15,910 --> 00:13:16,600
that we saw in this video,
谷歌新闻的例子

384
00:13:17,090 --> 00:13:17,950
we saw how you can use
我们介绍了你可以如何使用

385
00:13:18,080 --> 00:13:19,460
a clustering algorithm to cluster
聚类算法这些文章聚合在一起

386
00:13:19,880 --> 00:13:21,980
these articles together so that's Unsupervised Learning.
所以这是无监督学习问题

387
00:13:23,250 --> 00:13:25,440
The market segmentation example I
市场细分的例子

388
00:13:25,510 --> 00:13:27,120
talked a little bit earlier, you
我之前有说过

389
00:13:27,220 --> 00:13:29,110
can do that as an Unsupervised Learning problem
这也是一个无监督学习问题

390
00:13:29,970 --> 00:13:30,860
because I am just gonna
因为我是要

391
00:13:30,930 --> 00:13:32,340
get my algorithm data and ask
拿到数据 然后要求

392
00:13:32,500 --> 00:13:34,340
it to discover market segments automatically.
它自动发现细分市场

393
00:13:35,610 --> 00:13:37,930
And the final example, diabetes, well,
最后一个例子 糖尿病

394
00:13:38,070 --> 00:13:39,080
that's actually just like our
这实际上就像我们

395
00:13:39,350 --> 00:13:41,480
breast cancer example from the last video.
上节课讲到的乳腺癌的例子

396
00:13:42,190 --> 00:13:43,320
Only instead of, you know,
只不过这里不是

397
00:13:43,600 --> 00:13:45,280
good and bad cancer tumors or
好的或坏的癌细胞

398
00:13:45,550 --> 00:13:47,390
benign or malignant tumors we
良性或恶性肿瘤我们

399
00:13:47,550 --> 00:13:49,270
instead have diabetes or
现在是有糖尿病或

400
00:13:49,330 --> 00:13:50,440
not and so we will
没有糖尿病 所以这是

401
00:13:50,700 --> 00:13:51,830
use that as a supervised,
有监督的学习问题

402
00:13:52,370 --> 00:13:53,740
we will solve that as
像处理那个乳腺癌的问题一样

403
00:13:53,870 --> 00:13:54,670
a Supervised Learning problem just like
我们会把它作为一个

404
00:13:54,730 --> 00:13:56,450
we did for the breast tumor data.
有监督的学习问题来处理

405
00:13:58,270 --> 00:13:59,400
So, that's it for Unsupervised
好了 关于无监督学习问题

406
00:14:00,100 --> 00:14:01,580
Learning and in the
就讲这么多了

407
00:14:01,650 --> 00:14:02,940
next video, we'll delve more
下一节课中我们

408
00:14:03,270 --> 00:14:04,600
into specific learning algorithms
会涉及到更具体的学习算法

409
00:14:05,550 --> 00:14:06,590
and start to talk about
并开始讨论

410
00:14:07,220 --> 00:14:08,750
just how these algorithms work and
这些算法是如何工作的

411
00:14:08,920 --> 00:14:11,270
how we can, how you can go about implementing them.
以及我们如何来实现它们