Skip to content

Commit cc56c87

Browse files
committed
[SPARK-5806] re-organize sections in mllib-clustering.md
Put example code close to the algorithm description. Author: Xiangrui Meng <meng@databricks.com> Closes apache#4598 from mengxr/SPARK-5806 and squashes the following commits: a137872 [Xiangrui Meng] re-organize sections in mllib-clustering.md
1 parent 2e0c084 commit cc56c87

File tree

2 files changed

+77
-87
lines changed

2 files changed

+77
-87
lines changed

docs/mllib-clustering.md

Lines changed: 72 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,6 @@ title: Clustering - MLlib
44
displayTitle: <a href="mllib-guide.html">MLlib</a> - Clustering
55
---
66

7-
* Table of contents
8-
{:toc}
9-
10-
11-
## Clustering
12-
137
Clustering is an unsupervised learning problem whereby we aim to group subsets
148
of entities with one another based on some notion of similarity. Clustering is
159
often used for exploratory analysis and/or as a component of a hierarchical
@@ -18,7 +12,10 @@ models are trained for each cluster).
1812

1913
MLlib supports the following models:
2014

21-
### k-means
15+
* Table of contents
16+
{:toc}
17+
18+
## K-means
2219

2320
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
2421
most commonly used clustering algorithms that clusters the data points into a
@@ -37,72 +34,7 @@ a given dataset, the algorithm returns the best clustering result).
3734
* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
3835
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
3936

40-
### Gaussian mixture
41-
42-
A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
43-
represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
44-
each with its own probability. The MLlib implementation uses the
45-
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
46-
algorithm to induce the maximum-likelihood model given a set of samples. The implementation
47-
has the following parameters:
48-
49-
* *k* is the number of desired clusters.
50-
* *convergenceTol* is the maximum change in log-likelihood at which we consider convergence achieved.
51-
* *maxIterations* is the maximum number of iterations to perform without reaching convergence.
52-
* *initialModel* is an optional starting point from which to start the EM algorithm. If this parameter is omitted, a random starting point will be constructed from the data.
53-
54-
### Power Iteration Clustering
55-
56-
Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
57-
58-
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
59-
* calculates the principal eigenvalue and eigenvector
60-
* Clusters each of the input points according to their principal eigenvector component value
61-
62-
Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf}
63-
64-
Example outputs for a dataset inspired by the paper - but with five clusters instead of three- have he following output from our implementation:
65-
66-
<p style="text-align: center;">
67-
<img src="img/PIClusteringFiveCirclesInputsAndOutputs.png"
68-
title="The Property Graph"
69-
alt="The Property Graph"
70-
width="50%" />
71-
<!-- Images are downsized intentionally to improve quality on retina displays -->
72-
</p>
73-
74-
### Latent Dirichlet Allocation (LDA)
75-
76-
[Latent Dirichlet Allocation (LDA)](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
77-
is a topic model which infers topics from a collection of text documents.
78-
LDA can be thought of as a clustering algorithm as follows:
79-
80-
* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
81-
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
82-
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
83-
on a statistical model of how text documents are generated.
84-
85-
LDA takes in a collection of documents as vectors of word counts.
86-
It learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
87-
on the likelihood function. After fitting on the documents, LDA provides:
88-
89-
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
90-
* Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.
91-
92-
LDA takes the following parameters:
93-
94-
* `k`: Number of topics (i.e., cluster centers)
95-
* `maxIterations`: Limit on the number of iterations of EM used for learning
96-
* `docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
97-
* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
98-
* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
99-
100-
*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
101-
support prediction on new documents, and it does not have a Python API. These will be added in the future.
102-
103-
### Examples
104-
105-
#### k-means
37+
**Examples**
10638

10739
<div class="codetabs">
10840
<div data-lang="scala" markdown="1">
@@ -216,7 +148,21 @@ print("Within Set Sum of Squared Error = " + str(WSSSE))
216148

217149
</div>
218150

219-
#### GaussianMixture
151+
## Gaussian mixture
152+
153+
A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
154+
represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
155+
each with its own probability. The MLlib implementation uses the
156+
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
157+
algorithm to induce the maximum-likelihood model given a set of samples. The implementation
158+
has the following parameters:
159+
160+
* *k* is the number of desired clusters.
161+
* *convergenceTol* is the maximum change in log-likelihood at which we consider convergence achieved.
162+
* *maxIterations* is the maximum number of iterations to perform without reaching convergence.
163+
* *initialModel* is an optional starting point from which to start the EM algorithm. If this parameter is omitted, a random starting point will be constructed from the data.
164+
165+
**Examples**
220166

221167
<div class="codetabs">
222168
<div data-lang="scala" markdown="1">
@@ -322,7 +268,56 @@ for i in range(2):
322268

323269
</div>
324270

325-
#### Latent Dirichlet Allocation (LDA) Example
271+
## Power iteration clustering (PIC)
272+
273+
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
274+
275+
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
276+
* calculates the principal eigenvalue and eigenvector
277+
* Clusters each of the input points according to their principal eigenvector component value
278+
279+
Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf}
280+
281+
Example outputs for a dataset inspired by the paper - but with five clusters instead of three- have he following output from our implementation:
282+
283+
<p style="text-align: center;">
284+
<img src="img/PIClusteringFiveCirclesInputsAndOutputs.png"
285+
title="The Property Graph"
286+
alt="The Property Graph"
287+
width="50%" />
288+
<!-- Images are downsized intentionally to improve quality on retina displays -->
289+
</p>
290+
291+
## Latent Dirichlet allocation (LDA)
292+
293+
[Latent Dirichlet allocation (LDA)](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
294+
is a topic model which infers topics from a collection of text documents.
295+
LDA can be thought of as a clustering algorithm as follows:
296+
297+
* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
298+
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
299+
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
300+
on a statistical model of how text documents are generated.
301+
302+
LDA takes in a collection of documents as vectors of word counts.
303+
It learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
304+
on the likelihood function. After fitting on the documents, LDA provides:
305+
306+
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
307+
* Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.
308+
309+
LDA takes the following parameters:
310+
311+
* `k`: Number of topics (i.e., cluster centers)
312+
* `maxIterations`: Limit on the number of iterations of EM used for learning
313+
* `docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
314+
* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
315+
* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
316+
317+
*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
318+
support prediction on new documents, and it does not have a Python API. These will be added in the future.
319+
320+
**Examples**
326321

327322
In the following example, we load word count vectors representing a corpus of documents.
328323
We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
@@ -419,14 +414,7 @@ public class JavaLDAExample {
419414

420415
</div>
421416

422-
423-
In order to run the above application, follow the instructions
424-
provided in the [Self-Contained Applications](quick-start.html#self-contained-applications)
425-
section of the Spark
426-
Quick Start guide. Be sure to also include *spark-mllib* to your build file as
427-
a dependency.
428-
429-
## Streaming clustering
417+
## Streaming k-means
430418

431419
When data arrive in a stream, we may want to estimate clusters dynamically,
432420
updating them as new data arrive. MLlib provides support for streaming k-means clustering,
@@ -454,7 +442,7 @@ at time `t`, its contribution by time `t + halfLife` will have dropped to 0.5.
454442
The unit of time can be specified either as `batches` or `points` and the update rule
455443
will be adjusted accordingly.
456444

457-
### Examples
445+
**Examples**
458446

459447
This example shows how to estimate clusters on streaming data.
460448

docs/mllib-guide.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,11 @@ filtering, dimensionality reduction, as well as underlying optimization primitiv
2424
* [Collaborative filtering](mllib-collaborative-filtering.html)
2525
* alternating least squares (ALS)
2626
* [Clustering](mllib-clustering.html)
27-
* k-means
28-
* Gaussian mixture
29-
* power iteration
27+
* [k-means](mllib-clustering.html#k-means)
28+
* [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
29+
* [power iteration clustering (PIC)](mllib-clustering.html#power-iteration-clustering-pic)
30+
* [latent Dirichlet allocation (LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
31+
* [streaming k-means](mllib-clustering.html#streaming-k-means)
3032
* [Dimensionality reduction](mllib-dimensionality-reduction.html)
3133
* singular value decomposition (SVD)
3234
* principal component analysis (PCA)

0 commit comments

Comments
 (0)