Skip to content

illyanyc/unit13-ClusteringCrypto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

header

Illya Nayshevsky, Ph.D. LinkedIn -  Illya Nayshevsky


Columbia FinTech Bootcamp Assignment

Table of Contents


Overview

All tradable cryptocurrencies were clustered using the K-means model, the clusters were visualized and discussed. K-means clustering is a method of vector quantization where n observations are fit into k clusters in which each observation belongs to the cluster with the nearest mean Wiki. K-means is closely related to k-nearest neighbors.

The most common algorithm uses an iterative refinement technique, pictured below.

  1. k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color).

  1. k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means.

  1. The centroid of each of the k clusters becomes the new mean.

  1. Steps 2 and 3 are repeated until convergence has been reached.

All images were obtained from Wikipedia

Requirements

A new conda environment and Jupyter Notebook/Lab should be used. Data visualization is done with Plotly and hvPlot.

conda install -c conda-forge jupyterlab

pip install pandas
pip install matplotlib
pip install -U scikit-learn

conda install hvplot
conda install -c plotly plotly
conda install -c plotly plotly_express

conda install -c conda-forge nodejs
conda update nodejs

For Jupyter Notebook support:

conda install "notebook>=5.3" "ipywidgets>=7.5"

For Jupyter Lab support:

conda install jupyterlab "ipywidgets>=7.5"

# JupyterLab renderer support
jupyter labextension install jupyterlab-plotly@4.14.3

# OPTIONAL: Jupyter widgets extension
jupyter labextension install @jupyter-widgets/jupyterlab-manager plotlywidget@4.14.3

Node.js v14.17.0 or higher is required to build Jupyter Lab Extensions.

The instructions on how to update Node.js can be found here: How to Easily Update Node.js to the Latest Version


Data

  1. Cryptocurrency data was obtained from using CryptoCompare API.

  2. The following columns within the dataset were used in the analysis:

['CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','TotalCoinSupply']
  1. The dataset was filtered by:

    • IsTrading is True
    • ProofType is not "N/A"
    • Null values were removed
    • TotalCoinsMined > 0
  2. CoinName column was isolated.

  3. Non-numerical values were encoded.

df_encoded = pd.get_dummies(df, columns=['Algorithm','ProofType'])
  1. All data was scaled.
from sklearn.preprocessing import StandardScaler, MinMaxScaler

df_scaled = StandardScaler().fit_transform(df_encoded)
  1. Dimentionality was reduced to 3 principal components.
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
crypto_pca = pca.fit_transform(df_scaled)

The resulting dataset presents the following appearance:

Algorithm IsTrading ProofType TotalCoinsMined TotalCoinSupply
42 Scrypt True PoW/PoS 4.199995e+01 42
404 Scrypt True PoW/PoS 1.055185e+09 532000000

Clustering Using K-Means

Given a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional real vector, k-means partitions the n observations into k (≤ n) sets so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Wiki

Computational Methods

Given an initial set of k means m1(1),...,mk(1), the algorithm proceeds by alternating between two steps:

1. Assignment step - Assign each observation to the cluster with the nearest mean: that with the least squared Euclidean distance. (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means.) Wiki

2. Update step: - Recalculate means (centroids) for observations assigned to each cluster. Wiki

The iterations of the k-means algorithm are visualized below:

Determining Best Value for k

The best value for k was determined by calculating inertia for each potential value of k.

from sklearn.cluster import KMeans

inertia = []
k = list(range(1, 11))

for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(pcs_df)
    inertia.append(km.inertia_)

From the resulting elbow curve diagram it was determined that k = 4 was optimal.

elbow_curve

K-Means Clustering Result

K-means model was then instantiated and fir with k=4.

# Initialize the K-Means model
model = KMeans(n_clusters=4, random_state=0)

# Fit the model
model.fit(df)

# Predict clusters
pred = model.predict(df)

The results were then visualized:

The table of total tradable crypto-currencies shows that the clusters are spread out along the TotalCoinSupply / TotalCoinsMined slope, seemingly undifferenciated.

crypto_supply_vs_mined

But, if we look at the Principal Component 3D Scatter plot where x, y, and z are Principal Components 1, 2, and 3 respectively, it is evident that the data is in-fact clustered properly. The ambiguity of the Principal Components restricts from identifying the underlying data, and from plotting the results on raw data axes.

crypto_scatter

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published