Clustering is a type of unsupervised learning method. In unsupervised learning, we draw inferences from datasets consisting of input data without labeled responses. Generally, clustering is used to find meaningful structures, explanatory underlying processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same group are more similar to each other than to those in other groups. It is essentially a collection of objects based on their similarity and dissimilarity.
For example, the data points in the graph below clustered together can be classified into one single group. We can distinguish the clusters and identify that there are 3 clusters in the picture.
Clustering is important as it determines the intrinsic grouping among the unlabeled data present. There are no strict criteria for a good clustering; it depends on the user's needs. For instance, we could be interested in:
- Finding representatives for homogeneous groups (data reduction)
- Finding “natural clusters” and describing their unknown properties (“natural” data types)
- Finding useful and suitable groupings (“useful” data classes)
- Finding unusual data objects (outlier detection)
These methods consider clusters as dense regions having some similarity and different from the lower dense regions of the space. They have good accuracy and the ability to merge two clusters.
- Examples: DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points to Identify Clustering Structure)
Clusters formed in this method create a tree-type structure based on hierarchy. New clusters are formed using previously formed ones.
- Categories: Agglomerative (bottom-up approach), Divisive (top-down approach)
- Examples: CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering and using Hierarchies)
These methods partition the objects into k clusters, with each partition forming one cluster. This method is used to optimize an objective criterion similarity function, such as when distance is a major parameter.
- Examples: K-means, CLARANS (Clustering Large Applications based upon Randomized Search)
In this method, the data space is formulated into a finite number of cells that form a grid-like structure. All clustering operations done on these grids are fast and independent of the number of data objects.
- Examples: STING (Statistical Information Grid), WaveCluster, CLIQUE (CLustering In Quest)
K-means is the simplest unsupervised learning algorithm that solves clustering problems. The K-means algorithm partitions n observations into k clusters, where each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
- Marketing: Characterize and discover customer segments for marketing purposes.
- Biology: Classification among different species of plants and animals.
- Libraries: Clustering different books based on topics and information.
- Insurance: Acknowledge customers, their policies, and identify frauds.
- City Planning: Group houses and study their values based on geographical locations and other factors.
- Earthquake Studies: Determine dangerous zones by learning about earthquake-affected areas.
Feel free to contribute to this repository by adding more clustering methods, algorithms, and applications!