-
Notifications
You must be signed in to change notification settings - Fork 242
/
Copy pathStanford机器学习课程笔记4-Kmeans与高斯混合模型.html
254 lines (236 loc) · 16.4 KB
/
Stanford机器学习课程笔记4-Kmeans与高斯混合模型.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="generator" content="pandoc" />
<title></title>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; }
code > span.dt { color: #902000; }
code > span.dv { color: #40a070; }
code > span.bn { color: #40a070; }
code > span.fl { color: #40a070; }
code > span.ch { color: #4070a0; }
code > span.st { color: #4070a0; }
code > span.co { color: #60a0b0; font-style: italic; }
code > span.ot { color: #007020; }
code > span.al { color: #ff0000; font-weight: bold; }
code > span.fu { color: #06287e; }
code > span.er { color: #ff0000; font-weight: bold; }
</style>
<link rel="stylesheet" href="../stylesheets/Github.css" type="text/css" />
<title>Stanford机器学习课程笔记4-Kmeans与高斯混合模型</title>
</head>
<body>
<div id="header"><center>
<p class="header_titleline">
<a href="../index.html" target="_self" title="主页">主页 </a><a href="../Search.html" target="_self" title="站内搜索">站内搜索 </a><a href="../Projects.html" target="_self" title="项目研究">项目研究 </a><a href="../Archives.html" target="_self" title="文章存档">文章存档 </a><a href="../README.html" target="_self" title="分类目录">分类目录 </a><a href="../AboutMe.html" target="_self" title="关于我">关于我 </a>
</p>
</center></div>
<h1>Stanford机器学习课程笔记4-Kmeans与高斯混合模型</h1>
<h4>2014-05-15 / xiahouzuoxin</h4>
<h4>Tags: 机器学习</h4>
转载请注明出处: <a href="http://xiahouzuoxin.github.io/notes/">http://xiahouzuoxin.github.io/notes/</a>
<div id="TOC">
<ul>
<li><a href="#kmeans聚类">Kmeans聚类</a></li>
<li><a href="#混合高斯模型mixture-of-gaussian">混合高斯模型(Mixture of Gaussian)</a></li>
<li><a href="#参考">参考</a></li>
</ul>
</div>
<!---title:Stanford机器学习课程笔记4-Kmeans与高斯混合模型-->
<!---keywords:机器学习-->
<!---date:2014-05-15-->
<p>这一部分属于无监督学习的内容,无监督学习内容主要包括:Kmeans聚类算法、高斯混合模型及EM算法、Factor Analysis、PCA、ICA等。本文是Kmeans聚类算法、高斯混合模型的笔记,EM算法是适用于存在latent/hidden变量的通用算法,高斯混合模型仅仅是EM算法的一种特殊情况,关于EM算法的推到参见Andrew Ng讲义。由于公式太多,最近时间又忙实习的事就简单写一些,回头看时还得参考Ng的笔记和自己的打印Notes上的笔记,这里的程序对理解可能能提供另外的一些帮助。</p>
<h2 id="kmeans聚类">Kmeans聚类</h2>
<p>其实聚类算法除了Kmeans,还有其它的聚类方式,记得曾经接触到的就有层次聚类、基于模糊等价关系的<a href="http://blog.csdn.net/xiahouzuoxin/article/details/7748823">模糊聚类</a>。Kmeans只不过是这些聚类算法中最为常见的一中,也是我用得最多的一种。Kmeans本身的复杂度也高(O(N^3)),因此Kmeans有很多其它的变种:</p>
<ol style="list-style-type: decimal">
<li>针对浮点型数据的运算,为提高运算效率,将浮点型Round成整形(我操作的话一般先乘一个系数,以使数据分布在整个整数空间,降低Round的数据损失)</li>
<li>层次化的Kmeans聚类(HIKM)</li>
</ol>
<p>这两种改进都能在<a href="http://www.vlfeat.org/">VLFeat</a>里见到C代码实现。作为完善基础知识,我还是愿意先来复习下最基本的Kmeans聚类,后面再根据自己的经验补充点HIKM的东西。这是Stanford机器学习课程中第一个非监督学习算法,为严谨一些,先给出一般Kmeans聚类问题的描述:</p>
<ol style="list-style-type: decimal">
<li>给定数据集: <img src="http://latex.codecogs.com/gif.latex? {x^{(1)},x^{(2)},...,x^{(m)}}"></li>
<li>将数据group成K个clusters</li>
</ol>
<p>请注意:相对于监督学习算法,Kmeans虽然没有label信息,但聚类数K却是已知的,也就是说:使用Kmeans我们得计划好要聚成多少个cluster。当然也有一些Kmeans基础上的改进方法可以通过阈值直接判断聚类数目,这里不讨论。Kmeans算法的流程是:</p>
<div class="figure">
<img src="../images/Stanford机器学习课程笔记4-Kmeans与高斯混合模型/KmeansAlgorithm.png" />
</div>
<p>先随机初始化K个聚类中心,计算样本数据到聚类中心的距离,用聚类结果计算新的中心,如此迭代。提出两个问题:</p>
<ol style="list-style-type: decimal">
<li>随机初始化聚类中心怎么个随机法:这K个点是数据点还是非数据点。其实除了一般的随机,还有一种叫K-means++(参考2)的初始化方式</li>
<li><p>凡是迭代算法都有一个问题值得讨论:收敛准则。白话描述Kmeans收敛准则就是:聚类中心u稳定,不再发生太大的变化。严谨一些,定义Distortion Function:</p>
<p><img src="http://latex.codecogs.com/gif.latex? J(c,\mu)=\sum_{i=1}^m||x^{(i)}-\mu_{C^{i}}||^2"></p>
<p>J表示每个训练样本到被分配到的聚类中心的距离之和。当J最小时则Kmeans算法收敛。然而,由于J是非凸函数,利用坐标下降算法无法保证能收敛到全局最优解(坐标下降算法在Andrew Ng在SVM讲义部分中有讨论,是梯度下降算法的多维情况:先固定A变量,B变量用梯度下降迭代;再固定B变量,A变量用梯度下降迭代)。然而,在实际应用中,对譬如(1)存在多个局部最有之间波动的情况很少见(2)即使收敛到局部最有Kmeans算法,对于实际问题这个局部最优解也一般能够接受。所以Kmeans依然是使用最广泛的无监督聚类算法。</p></li>
</ol>
<p>Matlab的Statistics and Machine Learning Toolbox自带kmeans算法,更多内容可以参考一下Matlab中的<code>doc kmeans</code>或者 <a href="http://cn.mathworks.com/help/stats/kmeans.html" class="uri">http://cn.mathworks.com/help/stats/kmeans.html</a> 。</p>
<h2 id="混合高斯模型mixture-of-gaussian">混合高斯模型(Mixture of Gaussian)</h2>
<p>尖子班和普通班的成绩混在一起了,要把这两个班的成绩分开。只知道2个班的成绩都满足高斯分布,但不知道每个班的高斯分布的平均分和方差波动。如何通过无监督聚类的方法将这两个班的数据分开?上面的例子就是高斯混合模型的一个例子。</p>
<p>回想GDA(高斯判别分析)的生成模型,当时建立的模型是:p(y)服从伯努利实验,p(x|y)服从高斯分布,即隐藏在数据背后的高斯模型差生了训练数据,通过极大似然法使联合概率p(x,y)最大,求解得到伯努利参数phi,高斯分布参数u,sigma。混合高斯模型由于是无监督的数据,没有y,所以建模时假定一个latent/hidden(隐藏)变量z(等价GDA中的y),利用GDA的完全一样的极大似然法求解phi(z)、u(z)、sigma(z),但此时的pht、u、sigma都是和latent变量z有关的不确定的数。这就需要一种迭代算法:先估计z,更确切的说,这里是要来估计z的概率;再用z的估计值计算phi、u和sigma;更新z的估计,再迭代。。。。</p>
<p>高斯混合模型是EM算法的一个特例,迭代算法分为E-Step(估计z的概率)和M-Step(似然函数最大化),</p>
<div class="figure">
<img src="../images/Stanford机器学习课程笔记4-Kmeans与高斯混合模型/MixtureofGaussian.png" />
</div>
<p>在E-Step中,估计的是z的后验概率,可以先通过初始化的phi、u、sigma计算似然概率和先验概率,再用Bayes Rule得到z的后验估计。EM算法与Kmeans算法一样可能收敛到局部最优,有点不同的是EM算法的聚类中心数是可以自动决定的而Kmeans是预先给定的。下面是从 <a href="http://www.mathworks.com/matlabcentral/fileexchange/26184-em-algorithm-for-gaussian-mixture-model" class="uri">http://www.mathworks.com/matlabcentral/fileexchange/26184-em-algorithm-for-gaussian-mixture-model</a> 找到的一份高斯混合模型的EM代码,也可以下载完整的<a href="https:/xiahouzuoxin.github.io/notes/codes/StandfordMachineLearning/EM.zip">EM Example</a>在Matlab上运行</p>
<pre class="sourceCode matlab"><code class="sourceCode matlab">function [label, model, llh] = emgm(X, init)
<span class="co">% Perform EM algorithm for fitting the Gaussian mixture model.</span>
<span class="co">% X: d x n data matrix</span>
<span class="co">% init: k (1 x 1) or label (1 x n, 1<=label(i)<=k) or center (d x k)</span>
<span class="co">% Written by Michael Chen (sth4nth@gmail.com).</span>
<span class="co">%% initialization</span>
fprintf(<span class="st">'EM for Gaussian mixture: running ... \n'</span>);
R = initialization(X,init);
[~,label(<span class="fl">1</span>,:)] = max(R,[],<span class="fl">2</span>);
R = R(:,unique(label));
tol = <span class="fl">1e-10</span>;
maxiter = <span class="fl">500</span>;
llh = -inf(<span class="fl">1</span>,maxiter);
converged = false;
t = <span class="fl">1</span>;
while ~converged && t < maxiter
t = t+<span class="fl">1</span>;
model = maximization(X,R);
[R, llh(t)] = expectation(X,model);
[~,label(:)] = max(R,[],<span class="fl">2</span>);
u = unique(label); <span class="co">% non-empty components</span>
if size(R,<span class="fl">2</span>) ~= size(u,<span class="fl">2</span>)
R = R(:,u); <span class="co">% remove empty components</span>
else
converged = llh(t)-llh(t-<span class="fl">1</span>) < tol*abs(llh(t));
end
end
llh = llh(<span class="fl">2</span>:t);
if converged
fprintf(<span class="st">'Converged in %d steps.\n'</span>,t-<span class="fl">1</span>);
else
fprintf(<span class="st">'Not converged in %d steps.\n'</span>,maxiter);
end
function R = initialization(X, init)
[d,n] = size(X);
if isstruct(init) <span class="co">% initialize with a model</span>
R = expectation(X,init);
elseif length(init) == <span class="fl">1</span> <span class="co">% random initialization</span>
k = init;
idx = randsample(n,k);
m = X(:,idx);
[~,label] = max(bsxfun(@minus,m'*X,dot(m,m,<span class="fl">1</span>)'/<span class="fl">2</span>),[],<span class="fl">1</span>);
[u,~,label] = unique(label);
while k ~= length(u)
idx = randsample(n,k);
m = X(:,idx);
[~,label] = max(bsxfun(@minus,m'*X,dot(m,m,<span class="fl">1</span>)'/<span class="fl">2</span>),[],<span class="fl">1</span>);
[u,~,label] = unique(label);
end
R = full(sparse(<span class="fl">1</span>:n,label,<span class="fl">1</span>,n,k,n));
elseif size(init,<span class="fl">1</span>) == <span class="fl">1</span> && size(init,<span class="fl">2</span>) == n <span class="co">% initialize with labels</span>
label = init;
k = max(label);
R = full(sparse(<span class="fl">1</span>:n,label,<span class="fl">1</span>,n,k,n));
elseif size(init,<span class="fl">1</span>) == d <span class="co">%initialize with only centers</span>
k = size(init,<span class="fl">2</span>);
m = init;
[~,label] = max(bsxfun(@minus,m'*X,dot(m,m,<span class="fl">1</span>)'/<span class="fl">2</span>),[],<span class="fl">1</span>);
R = full(sparse(<span class="fl">1</span>:n,label,<span class="fl">1</span>,n,k,n));
else
error(<span class="st">'ERROR: init is not valid.'</span>);
end
function [R, llh] = expectation(X, model)
mu = model.mu;
Sigma = model.Sigma;
w = model.weight;
n = size(X,<span class="fl">2</span>);
k = size(mu,<span class="fl">2</span>);
logRho = zeros(n,k);
for i = <span class="fl">1</span>:k
logRho(:,i) = loggausspdf(X,mu(:,i),Sigma(:,:,i));
end
logRho = bsxfun(@plus,logRho,log(w));
T = logsumexp(logRho,<span class="fl">2</span>);
llh = sum(T)/n; <span class="co">% loglikelihood</span>
logR = bsxfun(@minus,logRho,T);
R = exp(logR);
function model = maximization(X, R)
[d,n] = size(X);
k = size(R,<span class="fl">2</span>);
nk = sum(R,<span class="fl">1</span>);
w = nk/n;
mu = bsxfun(@times, X*R, <span class="fl">1</span>./nk);
Sigma = zeros(d,d,k);
sqrtR = sqrt(R);
for i = <span class="fl">1</span>:k
Xo = bsxfun(@minus,X,mu(:,i));
Xo = bsxfun(@times,Xo,sqrtR(:,i)');
Sigma(:,:,i) = Xo*Xo'/nk(i);
Sigma(:,:,i) = Sigma(:,:,i)+eye(d)*(<span class="fl">1e-6</span>); <span class="co">% add a prior for numerical stability</span>
end
model.mu = mu;
model.Sigma = Sigma;
model.weight = w;
function y = loggausspdf(X, mu, Sigma)
d = size(X,<span class="fl">1</span>);
X = bsxfun(@minus,X,mu);
[U,p]= chol(Sigma);
if p ~= <span class="fl">0</span>
error(<span class="st">'ERROR: Sigma is not PD.'</span>);
end
Q = U'\X;
q = dot(Q,Q,<span class="fl">1</span>); <span class="co">% quadratic term (M distance)</span>
c = d*log(<span class="fl">2</span>*pi)+<span class="fl">2</span>*sum(log(diag(U))); <span class="co">% normalization constant</span>
y = -(c+q)/<span class="fl">2</span>;</code></pre>
<h2 id="参考">参考</h2>
<ol style="list-style-type: decimal">
<li>Andrew Ng Lecture Notes.</li>
<li>D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proc. ACM-SIAM Symp. on Discrete Algorithms, 2007.</li>
</ol>
<div class="ds-thread" data-thread-key="Stanford机器学习课程笔记4-Kmeans与高斯混合模型" data-title="Stanford机器学习课程笔记4-Kmeans与高斯混合模型" data-url="xiahouzuoxin.github.io/notes/html/Stanford机器学习课程笔记4-Kmeans与高斯混合模型.html"></div>
<script>window._bd_share_config={"common":{"bdSnsKey":{},"bdText":"","bdMini":"2","bdMiniList":false,"bdPic":"","bdStyle":"0","bdSize":"16"},"slide":{"type":"slide","bdImg":"5","bdPos":"right","bdTop":"300"},"image":{"viewList":["qzone","tsina","tqq","renren","weixin"],"viewText":"分享到:","viewSize":"16"},"selectShare":{"bdContainerClass":null,"bdSelectMiniList":["qzone","tsina","tqq","renren","weixin"]}};with(document)0[(getElementsByTagName('head')[0]||body).appendChild(createElement('script')).src='http://bdimg.share.baidu.com/static/api/js/share.js?v=89860593.js?cdnversion='+~(-new Date()/36e5)];</script>
<!-- 多说公共JS代码 start (一个网页只需插入一次) -->
<script type="text/javascript">
var duoshuoQuery = {short_name:"xiahouzuoxin"};
(function() {
var ds = document.createElement('script');
ds.type = 'text/javascript';ds.async = true;
ds.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') + '//static.duoshuo.com/embed.js';
ds.charset = 'UTF-8';
(document.getElementsByTagName('head')[0]
|| document.getElementsByTagName('body')[0]).appendChild(ds);
})();
</script>
<!-- 多说公共JS代码 end -->
<div id="footer">
<p class="footer_subline">联系邮箱: xiahouzuoxin@163.com</p>
<p class="footer_subline">声明: 本站所有文章如非特别说明均为原创,转载请注明出处!
<script type="text/javascript">var cnzz_protocol = (("https:" == document.location.protocol) ? " https://" : " http://");document.write(unescape("%3Cspan id='cnzz_stat_icon_1253219218'%3E%3C/span%3E%3Cscript src='" + cnzz_protocol + "s4.cnzz.com/z_stat.php%3Fid%3D1253219218%26show%3Dpic' type='text/javascript'%3E%3C/script%3E"));</script>
</p>
</div>
<!-- 回到顶部 -->
<script>
lastScrollY=0;
function heartBeat(){
var diffY;
if (document.documentElement && document.documentElement.scrollTop)
diffY = document.documentElement.scrollTop;
else if (document.body)
diffY = document.body.scrollTop
else
{/*Netscape stuff*/}
percent=.1*(diffY-lastScrollY);
if(percent>0)percent=Math.ceil(percent);
else percent=Math.floor(percent);
document.getElementById("full").style.top=parseInt(document.getElementById("full").style.top)+percent+"px";
lastScrollY=lastScrollY+percent;
}
suspendcode="<div id=\"full\" style='right:1px;POSITION:absolute;TOP:600px;z-index:100'><a onclick='window.scrollTo(0,0);'><img src='../images/top.png'></a><br></div>"
document.write(suspendcode);
window.setInterval("heartBeat()",1);
</script>
</body>
</html>