From 98a12e58f9834c36d9c5f866e0a60db9033119af Mon Sep 17 00:00:00 2001 From: Jen Looper Date: Thu, 17 Jun 2021 21:20:44 -0400 Subject: [PATCH] a note about the data used --- 5-Clustering/1-Visualize/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/5-Clustering/1-Visualize/README.md b/5-Clustering/1-Visualize/README.md index 9014cb0b7d..f0fa973a78 100644 --- a/5-Clustering/1-Visualize/README.md +++ b/5-Clustering/1-Visualize/README.md @@ -1,6 +1,6 @@ # Introduction to clustering -Clustering is a type of [Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning) that presumes that a dataset is unlabelled. It uses various algorithms to sort through unlabeled data and provide groupings according to patterns it discerns in the data. +Clustering is a type of [Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning) that presumes that a dataset is unlabelled or that its inputs are not matched with predefined outputs. It uses various algorithms to sort through unlabeled data and provide groupings according to patterns it discerns in the data. [![No One Like You by PSquare](https://img.youtube.com/vi/ty2advRiWJM/0.jpg)](https://youtu.be/ty2advRiWJM "No One Like You by PSquare") @@ -211,6 +211,8 @@ df.describe() | 75% | 2017 | 242098.5 | 31 | 0.8295 | 0.403 | 0.87575 | 0.000234 | 0.164 | -3.331 | 0.177 | 125.03925 | 4 | | max | 2020 | 511738 | 73 | 0.966 | 0.954 | 0.995 | 0.91 | 0.811 | 0.582 | 0.514 | 206.007 | 5 | +> 🤔 If we are working with clustering, an unsupervised method that does not require labeled data, why are we showing this data with labels? In the data exploration phase, they come in handy, but they are not necessary for the clustering algorithms to work. You could just as well remove the column headers and refer to the data by column number. + Look at the general values of the data. Note that popularity can be '0', which show songs that have no ranking. Let's remove those shortly. Use a barplot to find out the most popular genres: