UNC_data_bootcamp_module_19
For this challenge, we need to use our knowledge of Python via a Jupyter Notebook and Unsupervised Learning to predict if cryptocurrencies are affected by 24-hour or 7-day price changes.
Our goal for this challenge is detailed a requires us to follow a series of steps broken down within six greater sections. The Challenge Instuctions are outlined within each section, and the section are to be completed as follows:
- Prepare the Data
- Find the Best Value for k Using the Original Scaled DataFrame
- Cluster Cryptocurrencies with K-means Using the Original Scaled Data
- Optimize Clusters with Principal Component Analysis
- Find the Best Value for k Using the PCA Data
- Cluster Cryptocurrencies with K-means Using the PCA Data
To start off I needed to rename the Crypto_Clustering_starter_code.ipynb file as Crypto_Clustering_SDT.ipynb. I viewed the file crypto_market_data.csv separately first to understand the data scource better, then I loaded the crypto_market_data.csv into a DataFrame. From here I am able to acquire the metrics and plot the data needed complete the challenge per the instructions:
- Use the StandardScaler() module from scikit-learn to normalize the data from the CSV file.
- Create a DataFrame with the scaled data and set the "coin_id" index from the original DataFrame as the index for the new DataFrame.
In this section I will use the elbow method to find the best value for k per the instructions from the challenge:
- Create a list with the number of k values from 1 to 11.
- Create an empty list to store the inertia values.
- Create a for loop to compute the inertia with each possible value of k.
- Create a dictionary with the data to plot the elbow curve.
- Plot a line chart with all the inertia values computed with the different values of k to visually identify the optimal value for k.
- Answer the following question in your notebook: What is the best value for k?
For this section I will use the following steps per the challenge instructions to cluster the cryptocurrencies for the best value for k of the original scaled data:
- Initialize the K-means model with the best value for k.
- Fit the K-means model using the original scaled DataFrame.
- Predict the clusters to group the cryptocurrencies using the original scaled DataFrame.
- Create a copy of the original data and add a new column with the predicted clusters.
- Create a scatter plot using hvPlot as follows:
- Set the x-axis as "PC1" and the y-axis as "PC2".
- Color the graph points with the labels found using K-means.
- Add the "coin_id" column in the hover_cols parameter to identify the cryptocurrency represented by each data point.
In this section I will further refine the clusters using Principal Component Analysis (PCA) per the challenge instructions:
- Using the original scaled DataFrame, perform a PCA and reduce the features to three principal components.
- Retrieve the explained variance to determine how much information can be attributed to each principal component and then answer the following question in your notebook:
- What is the total explained variance of the three principal components?
- Create a new DataFrame with the PCA data and set the "coin_id" index from the original DataFrame as the index for the new DataFrame.
Following challenge instructions I will again use the elbow method on the PCA data to find the best value for k by:
- Create a list with the number of k-values from 1 to 11.
- Create an empty list to store the inertia values.
- Create a for loop to compute the inertia with each possible value of k.
- Create a dictionary with the data to plot the Elbow curve.
- Plot a line chart with all the inertia values computed with the different values of k to visually identify the optimal value for k.
- Answer the following question in your notebook:
- What is the best value for k when using the PCA data?
- Does it differ from the best k value found using the original data?
Finally, I will complete the following steps per the challenge instructions to cluster the cryptocurrencies for the best value for k on the PCA data:
- Initialize the K-means model with the best value for k.
- Fit the K-means model using the PCA data.
- Predict the clusters to group the cryptocurrencies using the PCA data.
- Create a copy of the DataFrame with the PCA data and add a new column to store the predicted clusters.
- Create a scatter plot using hvPlot as follows:
- Set the x-axis as "price_change_percentage_24h" and the y-axis as "price_change_percentage_7d".
- Color the graph points with the labels found using K-means.
- Add the "coin_id" column in the hover_cols parameter to identify the cryptocurrency represented by each data point.
- Answer the following question:
- What is the impact of using fewer features to cluster the data using K-Means?
Module 19 Instructions
starter_code
- Crypto_Clustering_starter_code.ipynb
Resources
- crypto_market_data.csv
Special Thanks: (for Challenge overview discussions during BootCamp office hours)
- Jamie Miller
- Mounika Mamindla
- Lisa Shemanciik
(where possible will provide link to website)
- pandas documentation
- hvplot documentation
- scikit-learn documenation
- YouTube (various videos)
- GitHub