Skip to content

A brief explanation of various types of Clustering, with project on Real world examples using well Organized Notebooks for each type of clustering.

License

Notifications You must be signed in to change notification settings

Sk70249/Diffrent-types-of-Clustering-Unsupervised-Learning

Repository files navigation

Diffrent types of Clustering: Unsupervised-Learning   license

Introduction

There are many models for clustering out there. In this notebook, we will be presenting the model that is considered one of the simplest models amongst them. Despite its simplicity, the K-means is vastly used for clustering in many data science applications, especially useful if you need to quickly discover insights from unlabeled data. In this notebook, you will learn how to use k-Means for customer segmentation.

Some real-world applications of k-means:

  • Customer segmentation
  • Understand what the visitors of a website are trying to accomplish
  • Pattern recognition
  • Machine learning
  • Data compression

In this notebook we practice k-means clustering with 2 examples:

  • k-means on a random generated dataset
  • Using k-means for customer segmentation

Customer Segmentation with K-Means

Imagine that you have a customer dataset, and you need to apply customer segmentation on this historical data. Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy as a business can target these specific groups of customers and effectively allocate marketing resources. For example, one group might contain customers who are high-profit and low-risk, that is, more likely to purchase products, or subscribe for a service. A business task is to retaining those customers. Another group might include customers from non-profit organizations. And so on.

Now, lets look at the distribution of customers based on their - Age and Income:

download (9)

Now, lets look at the distribution of customers based on their - Education, Age and Income:

download (10)

Hierarchical Clustering - Agglomerative

We will be looking at a clustering technique, which is Agglomerative Hierarchical Clustering. Remember that agglomerative is the bottom up approach.

In this lab, we will be looking at Agglomerative clustering, which is more popular than Divisive clustering.

We will also be using Complete Linkage as the Linkage Criteria.
NOTE: You can also try using Average Linkage wherever Complete Linkage would be used to see the difference!

Dendrogram Associated for the Agglomerative Hierarchical Clustering

Remember that a distance matrix contains the distance from each point to every other point of a dataset .
Use the function distance_matrix, which requires two inputs. Use the Feature Matrix, X2 as both inputs and save the distance matrix to a variable called dist_matrix

Remember that the distance values are symmetric, with a diagonal of 0's. This is one way of making sure your matrix is correct.
(print out dist_matrix to make sure it's correct)

A Hierarchical clustering is typically visualized as a dendrogram as shown in the following cell. Each merge is represented by a horizontal line. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged, where cities are viewed as singleton clusters. By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted clustering.

Next, we will save the dendrogram to a variable called dendro. In doing this, the dendrogram will also be displayed. Using the dendrogram class from hierarchy, pass in the parameter:

  • Z

download (7)

Clustering on Vehicle dataset

Imagine that an automobile manufacturer has developed prototypes for a new vehicle. Before introducing the new model into its range, the manufacturer wants to determine which existing vehicles on the market are most like the prototypes--that is, how vehicles can be grouped, which group is the most similar with the model, and therefore which models they will be competing against.

Our objective here, is to use clustering methods, to find the most distinctive clusters of vehicles. It will summarize the existing vehicles and help manufacturers to make decision about the supply of new models.

Final Output:

download (8)

Most of the traditional clustering techniques, such as k-means, hierarchical and fuzzy clustering, can be used to group data without supervision.

However, when applied to tasks with arbitrary shape clusters, or clusters within cluster, the traditional techniques might be unable to achieve good results. That is, elements in the same cluster might not share enough similarity or the performance may be poor. Additionally, Density-based Clustering locates regions of high density that are separated from one another by regions of low density. Density, in this context, is defined as the number of points within a specified radius.

Weather Station Clustering using DBSCAN & scikit-learn


DBSCAN is specially very good for tasks like class identification on a spatial context. The wonderful attribute of DBSCAN algorithm is that it can find out any arbitrary shape cluster without getting affected by noise. For example, this following example cluster the location of weather stations in Canada. <Click 1> DBSCAN can be used here, for instance, to find the group of stations which show the same weather condition. As you can see, it not only finds different arbitrary shaped clusters, can find the denser part of data-centered samples by ignoring less-dense areas or noises.

let's start playing with the data. We will be working according to the following workflow:

  1. Loading data
  • Overview data
  • Data cleaning
  • Data selection
  • Clusteing

About the dataset

Environment Canada
Monthly Values for July - 2015

Name in the table Meaning
Stn_Name Station Name
Lat Latitude (North+, degrees)
Long Longitude (West - , degrees)
Prov Province
Tm Mean Temperature (°C)
DwTm Days without Valid Mean Temperature
D Mean Temperature difference from Normal (1981-2010) (°C)
Tx Highest Monthly Maximum Temperature (°C)
DwTx Days without Valid Maximum Temperature
Tn Lowest Monthly Minimum Temperature (°C)
DwTn Days without Valid Minimum Temperature
S Snowfall (cm)
DwS Days without Valid Snowfall
S%N Percent of Normal (1981-2010) Snowfall
P Total Precipitation (mm)
DwP Days without Valid Precipitation
P%N Percent of Normal (1981-2010) Precipitation
S_G Snow on the ground at the end of the month (cm)
Pd Number of days with Precipitation 1.0 mm or more
BS Bright Sunshine (hours)
DwBS Days without Valid Bright Sunshine
BS% Percent of Normal (1981-2010) Bright Sunshine
HDD Degree Days below 18 °C
CDD Degree Days above 18 °C
Stn_No Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically).
NA Not Available

Visualization

Visualization of stations on map using basemap package. The matplotlib basemap toolkit is a library for plotting 2D data on maps in Python. Basemap does not do any plotting on it’s own, but provides the facilities to transform coordinates to a map projections.

Please notice that the size of each data points represents the average of maximum temperature for each station in a year:

download (11)

Clustering of stations based on their location i.e. Lat & Lon

DBSCAN form sklearn library can runs DBSCAN clustering from vector array or distance matrix. In our case, we pass it the Numpy array Clus_dataSet to find core samples of high density and expands clusters from them. Cluster 0, Avg Temp: -5.538747553816051 Cluster 1, Avg Temp: 1.9526315789473685 Cluster 2, Avg Temp: -9.195652173913045 Cluster 3, Avg Temp: -15.300833333333333 Cluster 4, Avg Temp: -7.769047619047619 download (12)

Visualization of clusters based on location and Temperture:

Cluster 0, Avg Temp: 6.2211920529801334 Cluster 1, Avg Temp: 6.790000000000001 Cluster 2, Avg Temp: -0.49411764705882355 Cluster 3, Avg Temp: -13.877209302325586 Cluster 4, Avg Temp: -4.186274509803922 Cluster 5, Avg Temp: -16.301503759398482 Cluster 6, Avg Temp: -13.599999999999998 Cluster 7, Avg Temp: -9.753333333333334 Cluster 8, Avg Temp: -4.258333333333334

download (13)

Thanks for reading for more such content visit my profile Sk70249

About

A brief explanation of various types of Clustering, with project on Real world examples using well Organized Notebooks for each type of clustering.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published