There are many models for clustering out there. In this notebook, we will be presenting the model that is considered one of the simplest models amongst them. Despite its simplicity, the K-means is vastly used for clustering in many data science applications, especially useful if you need to quickly discover insights from unlabeled data. In this notebook, you will learn how to use k-Means for customer segmentation.
Some real-world applications of k-means:
- Customer segmentation
- Understand what the visitors of a website are trying to accomplish
- Pattern recognition
- Machine learning
- Data compression
In this notebook we practice k-means clustering with 2 examples:
- k-means on a random generated dataset
- Using k-means for customer segmentation
We will be looking at a clustering technique, which is Agglomerative Hierarchical Clustering. Remember that agglomerative is the bottom up approach.
In this lab, we will be looking at Agglomerative clustering, which is more popular than Divisive clustering.
We will also be using Complete Linkage as the Linkage Criteria.
NOTE: You can also try using Average Linkage wherever Complete Linkage would be used to see the difference!
Use the function distance_matrix, which requires two inputs. Use the Feature Matrix, X2 as both inputs and save the distance matrix to a variable called dist_matrix
Remember that the distance values are symmetric, with a diagonal of 0's. This is one way of making sure your matrix is correct.
(print out dist_matrix to make sure it's correct)
A Hierarchical clustering is typically visualized as a dendrogram as shown in the following cell. Each merge is represented by a horizontal line. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged, where cities are viewed as singleton clusters. By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted clustering.
Next, we will save the dendrogram to a variable called dendro. In doing this, the dendrogram will also be displayed. Using the dendrogram class from hierarchy, pass in the parameter:
- Z
Our objective here, is to use clustering methods, to find the most distinctive clusters of vehicles. It will summarize the existing vehicles and help manufacturers to make decision about the supply of new models.
Most of the traditional clustering techniques, such as k-means, hierarchical and fuzzy clustering, can be used to group data without supervision.
However, when applied to tasks with arbitrary shape clusters, or clusters within cluster, the traditional techniques might be unable to achieve good results. That is, elements in the same cluster might not share enough similarity or the performance may be poor. Additionally, Density-based Clustering locates regions of high density that are separated from one another by regions of low density. Density, in this context, is defined as the number of points within a specified radius.
DBSCAN is specially very good for tasks like class identification on a spatial context. The wonderful attribute of DBSCAN algorithm is that it can find out any arbitrary shape cluster without getting affected by noise. For example, this following example cluster the location of weather stations in Canada. <Click 1> DBSCAN can be used here, for instance, to find the group of stations which show the same weather condition. As you can see, it not only finds different arbitrary shaped clusters, can find the denser part of data-centered samples by ignoring less-dense areas or noises.
let's start playing with the data. We will be working according to the following workflow:
- Loading data
- Overview data
- Data cleaning
- Data selection
- Clusteing
Environment Canada
Monthly Values for July - 2015
Name in the table | Meaning |
---|---|
Stn_Name | Station Name |
Lat | Latitude (North+, degrees) |
Long | Longitude (West - , degrees) |
Prov | Province |
Tm | Mean Temperature (°C) |
DwTm | Days without Valid Mean Temperature |
D | Mean Temperature difference from Normal (1981-2010) (°C) |
Tx | Highest Monthly Maximum Temperature (°C) |
DwTx | Days without Valid Maximum Temperature |
Tn | Lowest Monthly Minimum Temperature (°C) |
DwTn | Days without Valid Minimum Temperature |
S | Snowfall (cm) |
DwS | Days without Valid Snowfall |
S%N | Percent of Normal (1981-2010) Snowfall |
P | Total Precipitation (mm) |
DwP | Days without Valid Precipitation |
P%N | Percent of Normal (1981-2010) Precipitation |
S_G | Snow on the ground at the end of the month (cm) |
Pd | Number of days with Precipitation 1.0 mm or more |
BS | Bright Sunshine (hours) |
DwBS | Days without Valid Bright Sunshine |
BS% | Percent of Normal (1981-2010) Bright Sunshine |
HDD | Degree Days below 18 °C |
CDD | Degree Days above 18 °C |
Stn_No | Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically). |
NA | Not Available |
Visualization of stations on map using basemap package. The matplotlib basemap toolkit is a library for plotting 2D data on maps in Python. Basemap does not do any plotting on it’s own, but provides the facilities to transform coordinates to a map projections.
Please notice that the size of each data points represents the average of maximum temperature for each station in a year:
DBSCAN form sklearn library can runs DBSCAN clustering from vector array or distance matrix. In our case, we pass it the Numpy array Clus_dataSet to find core samples of high density and expands clusters from them.
Cluster 0, Avg Temp: -5.538747553816051
Cluster 1, Avg Temp: 1.9526315789473685
Cluster 2, Avg Temp: -9.195652173913045
Cluster 3, Avg Temp: -15.300833333333333
Cluster 4, Avg Temp: -7.769047619047619
Cluster 0, Avg Temp: 6.2211920529801334 Cluster 1, Avg Temp: 6.790000000000001 Cluster 2, Avg Temp: -0.49411764705882355 Cluster 3, Avg Temp: -13.877209302325586 Cluster 4, Avg Temp: -4.186274509803922 Cluster 5, Avg Temp: -16.301503759398482 Cluster 6, Avg Temp: -13.599999999999998 Cluster 7, Avg Temp: -9.753333333333334 Cluster 8, Avg Temp: -4.258333333333334