You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-5806] re-organize sections in mllib-clustering.md
Put example code close to the algorithm description.
Author: Xiangrui Meng <meng@databricks.com>
Closesapache#4598 from mengxr/SPARK-5806 and squashes the following commits:
a137872 [Xiangrui Meng] re-organize sections in mllib-clustering.md
algorithm to induce the maximum-likelihood model given a set of samples. The implementation
47
-
has the following parameters:
48
-
49
-
**k* is the number of desired clusters.
50
-
**convergenceTol* is the maximum change in log-likelihood at which we consider convergence achieved.
51
-
**maxIterations* is the maximum number of iterations to perform without reaching convergence.
52
-
**initialModel* is an optional starting point from which to start the EM algorithm. If this parameter is omitted, a random starting point will be constructed from the data.
53
-
54
-
### Power Iteration Clustering
55
-
56
-
Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
57
-
58
-
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
59
-
* calculates the principal eigenvalue and eigenvector
60
-
* Clusters each of the input points according to their principal eigenvector component value
61
-
62
-
Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf}
63
-
64
-
Example outputs for a dataset inspired by the paper - but with five clusters instead of three- have he following output from our implementation:
is a topic model which infers topics from a collection of text documents.
78
-
LDA can be thought of as a clustering algorithm as follows:
79
-
80
-
* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
81
-
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
82
-
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
83
-
on a statistical model of how text documents are generated.
84
-
85
-
LDA takes in a collection of documents as vectors of word counts.
86
-
It learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
87
-
on the likelihood function. After fitting on the documents, LDA provides:
88
-
89
-
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
90
-
* Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.
91
-
92
-
LDA takes the following parameters:
93
-
94
-
*`k`: Number of topics (i.e., cluster centers)
95
-
*`maxIterations`: Limit on the number of iterations of EM used for learning
96
-
*`docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
97
-
*`topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
98
-
*`checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
99
-
100
-
*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
101
-
support prediction on new documents, and it does not have a Python API. These will be added in the future.
102
-
103
-
### Examples
104
-
105
-
#### k-means
37
+
**Examples**
106
38
107
39
<divclass="codetabs">
108
40
<divdata-lang="scala"markdown="1">
@@ -216,7 +148,21 @@ print("Within Set Sum of Squared Error = " + str(WSSSE))
216
148
217
149
</div>
218
150
219
-
#### GaussianMixture
151
+
## Gaussian mixture
152
+
153
+
A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
154
+
represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
155
+
each with its own probability. The MLlib implementation uses the
algorithm to induce the maximum-likelihood model given a set of samples. The implementation
158
+
has the following parameters:
159
+
160
+
**k* is the number of desired clusters.
161
+
**convergenceTol* is the maximum change in log-likelihood at which we consider convergence achieved.
162
+
**maxIterations* is the maximum number of iterations to perform without reaching convergence.
163
+
**initialModel* is an optional starting point from which to start the EM algorithm. If this parameter is omitted, a random starting point will be constructed from the data.
164
+
165
+
**Examples**
220
166
221
167
<divclass="codetabs">
222
168
<divdata-lang="scala"markdown="1">
@@ -322,7 +268,56 @@ for i in range(2):
322
268
323
269
</div>
324
270
325
-
#### Latent Dirichlet Allocation (LDA) Example
271
+
## Power iteration clustering (PIC)
272
+
273
+
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
274
+
275
+
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
276
+
* calculates the principal eigenvalue and eigenvector
277
+
* Clusters each of the input points according to their principal eigenvector component value
278
+
279
+
Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf}
280
+
281
+
Example outputs for a dataset inspired by the paper - but with five clusters instead of three- have he following output from our implementation:
is a topic model which infers topics from a collection of text documents.
295
+
LDA can be thought of as a clustering algorithm as follows:
296
+
297
+
* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
298
+
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
299
+
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
300
+
on a statistical model of how text documents are generated.
301
+
302
+
LDA takes in a collection of documents as vectors of word counts.
303
+
It learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
304
+
on the likelihood function. After fitting on the documents, LDA provides:
305
+
306
+
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
307
+
* Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.
308
+
309
+
LDA takes the following parameters:
310
+
311
+
*`k`: Number of topics (i.e., cluster centers)
312
+
*`maxIterations`: Limit on the number of iterations of EM used for learning
313
+
*`docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
314
+
*`topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
315
+
*`checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
316
+
317
+
*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
318
+
support prediction on new documents, and it does not have a Python API. These will be added in the future.
319
+
320
+
**Examples**
326
321
327
322
In the following example, we load word count vectors representing a corpus of documents.
328
323
We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
@@ -419,14 +414,7 @@ public class JavaLDAExample {
419
414
420
415
</div>
421
416
422
-
423
-
In order to run the above application, follow the instructions
424
-
provided in the [Self-Contained Applications](quick-start.html#self-contained-applications)
425
-
section of the Spark
426
-
Quick Start guide. Be sure to also include *spark-mllib* to your build file as
427
-
a dependency.
428
-
429
-
## Streaming clustering
417
+
## Streaming k-means
430
418
431
419
When data arrive in a stream, we may want to estimate clusters dynamically,
432
420
updating them as new data arrive. MLlib provides support for streaming k-means clustering,
@@ -454,7 +442,7 @@ at time `t`, its contribution by time `t + halfLife` will have dropped to 0.5.
454
442
The unit of time can be specified either as `batches` or `points` and the update rule
455
443
will be adjusted accordingly.
456
444
457
-
### Examples
445
+
**Examples**
458
446
459
447
This example shows how to estimate clusters on streaming data.
0 commit comments