Shiny application illustrating the k-means clustering method
author: Renaud DUFOUR Date: May 2015
- K-means is a distance-based method for cluster analysis in data mining
- It enables partitioning a set of data points into groups which are as similar as possible
- Each group, called cluster, is represented by its center
Given K, the number of clusters, k-means clustering works as follows:
- Select K points as initial centroids
- Repeat
- Form K clusters by assigning each point to its closest centroid
- Re-compute the centroids of each cluster
- Until convergence criterion is satisfied
- Different kinds of measures can be used (L1 norm, L2 norm, cosine similarity, ...)
- Illustrates K-mean clustering based on 2 datasets:
- the R built in iris dataset
- a dataset dat1 involving embedded clusters
- Enables to change the following parameters:
- dataset to be used
- variables on which the clustering is to be performed (note: 2D clustering only)
- number of clusters
- type of kernel : linear or radial (RBF)
- When using a non-linear kernel, the datapoints are first projected into the kernel space before clustering is performed.
- More informations on the K-means algorithm on wikipedia. I also recommend the Cluster Analysis In Data Mining class on Coursera, which actually inspired me this app.
- Potential improvements include :
- using interactive graphics (rchart, googleVis)
- computing clustering validation measures such as purity or normalized mutual information. Note that such external measures require knowing the true classes of the data points, which is the case for the 2 implemented datasets but not in general. Instead one could also consider internal measures such as Beta CV.
- Implementing other kernels and allow user to tune kernel parameters (actually parameter of RBF kernel is internally determined using an heuristic approach)
- Implementing alternative clustering techniques like k-medians or k-medoids
- Feel free to contact me for any question or suggestion !