You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/mllib-clustering.md
+25-10
Original file line number
Diff line number
Diff line change
@@ -156,7 +156,11 @@ a dependency.
156
156
157
157
## Streaming clustering
158
158
159
-
When data arrive in a stream, we may want to estimate clusters dynamically, updating them as new data arrive. MLlib provides support for streaming KMeans clustering, with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm uses a generalization of the mini-batch KMeans update rule. For each batch of data, we assign all points to their nearest cluster, compute new cluster centers, then update each cluster using:
159
+
When data arrive in a stream, we may want to estimate clusters dynamically,
160
+
updating them as new data arrive. MLlib provides support for streaming k-means clustering,
161
+
with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
162
+
uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
163
+
all points to their nearest cluster, compute new cluster centers, then update each cluster using:
@@ -165,7 +169,12 @@ When data arrive in a stream, we may want to estimate clusters dynamically, upda
165
169
n_{t+1} = n_t + m_t
166
170
\end{equation}`
167
171
168
-
Where `$c_t$` is the previous center for the cluster, `$n_t$` is the number of points assigned to the cluster thus far, `$x_t$` is the new cluster center from the current batch, and `$m_t$` is the number of points added to the cluster in the current batch. The decay factor `$\alpha$` can be used to ignore the past: with `$\alpha$=1` all data will be used from the beginning; with `$\alpha$=0` only the most recent data will be used. This is analogous to an expontentially-weighted moving average.
172
+
Where `$c_t$` is the previous center for the cluster, `$n_t$` is the number of points assigned
173
+
to the cluster thus far, `$x_t$` is the new cluster center from the current batch, and `$m_t$`
174
+
is the number of points added to the cluster in the current batch. The decay factor `$\alpha$`
175
+
can be used to ignore the past: with `$\alpha$=1` all data will be used from the beginning;
176
+
with `$\alpha$=0` only the most recent data will be used. This is analogous to an
Then we make an input stream of vectors for training, as well as one for testing. We assume a StreamingContext `ssc` has been created, see [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for more info. For this example, we use vector data.
197
+
Then we make an input stream of vectors for training, as well as a stream of labeled data
198
+
points for testing. We assume a StreamingContext `ssc` has been created, see
199
+
[Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for more info.
189
200
190
201
{% highlight scala %}
191
202
@@ -201,13 +212,14 @@ We create a model with random clusters and specify the number of clusters to fin
201
212
val numDimensions = 3
202
213
val numClusters = 2
203
214
val model = new StreamingKMeans()
204
-
.setK(numClusters)
205
-
.setDecayFactor(1.0)
206
-
.setRandomWeights(numDimensions)
215
+
.setK(numClusters)
216
+
.setDecayFactor(1.0)
217
+
.setRandomWeights(numDimensions)
207
218
208
219
{% endhighlight %}
209
220
210
-
Now register the streams for training and testing and start the job, printing the predicted cluster assignments on new data points as they arrive.
221
+
Now register the streams for training and testing and start the job, printing
222
+
the predicted cluster assignments on new data points as they arrive.
211
223
212
224
{% highlight scala %}
213
225
@@ -219,9 +231,12 @@ ssc.awaitTermination()
219
231
220
232
{% endhighlight %}
221
233
222
-
As you add new text files with data the cluster centers will update. Each data point should be formatted as `[x1, x2, x3]`. Anytime a text file is placed in `/training/data/dir`
223
-
the model will update. Anytime a text file is placed in `/testing/data/dir` you will see predictions. With new data, the cluster centers will change.
224
-
234
+
As you add new text files with data the cluster centers will update. Each training
235
+
point should be formatted as `[x1, x2, x3]`, and each test data point
236
+
should be formatted as `(y, [x1, x2, x3])`, where `y` is some useful label or identifier
237
+
(e.g. a true category assignment). Anytime a text file is placed in `/training/data/dir`
238
+
the model will update. Anytime a text file is placed in `/testing/data/dir`
239
+
you will see predictions. With new data, the cluster centers will change!
0 commit comments