GitHub - illyanyc/unit13-ClusteringCrypto

Columbia FinTech Bootcamp Assignment

Overview

All tradable cryptocurrencies were clustered using the K-means model, the clusters were visualized and discussed. K-means clustering is a method of vector quantization where n observations are fit into k clusters in which each observation belongs to the cluster with the nearest mean Wiki. K-means is closely related to k-nearest neighbors.

The most common algorithm uses an iterative refinement technique, pictured below.

k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color).

k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means.

The centroid of each of the k clusters becomes the new mean.

Steps 2 and 3 are repeated until convergence has been reached.

All images were obtained from Wikipedia

Requirements

A new conda environment and Jupyter Notebook/Lab should be used. Data visualization is done with Plotly and hvPlot.

conda install -c conda-forge jupyterlab

pip install pandas
pip install matplotlib
pip install -U scikit-learn

conda install hvplot
conda install -c plotly plotly
conda install -c plotly plotly_express

conda install -c conda-forge nodejs
conda update nodejs

For Jupyter Notebook support:

conda install "notebook>=5.3" "ipywidgets>=7.5"

For Jupyter Lab support:

conda install jupyterlab "ipywidgets>=7.5"

# JupyterLab renderer support
jupyter labextension install jupyterlab-plotly@4.14.3

# OPTIONAL: Jupyter widgets extension
jupyter labextension install @jupyter-widgets/jupyterlab-manager plotlywidget@4.14.3

Node.js v14.17.0 or higher is required to build Jupyter Lab Extensions.

The instructions on how to update Node.js can be found here: How to Easily Update Node.js to the Latest Version

Data

Cryptocurrency data was obtained from using CryptoCompare API.
The following columns within the dataset were used in the analysis:

['CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','TotalCoinSupply']

The dataset was filtered by:
- IsTrading is True
- ProofType is not "N/A"
- Null values were removed
- TotalCoinsMined > 0
CoinName column was isolated.
Non-numerical values were encoded.

df_encoded = pd.get_dummies(df, columns=['Algorithm','ProofType'])

All data was scaled.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

df_scaled = StandardScaler().fit_transform(df_encoded)

Dimentionality was reduced to 3 principal components.

from sklearn.decomposition import PCA

pca = PCA(n_components=3)
crypto_pca = pca.fit_transform(df_scaled)

The resulting dataset presents the following appearance:

	Algorithm	IsTrading	ProofType	TotalCoinsMined	TotalCoinSupply
42	Scrypt	True	PoW/PoS	4.199995e+01	42
404	Scrypt	True	PoW/PoS	1.055185e+09	532000000

Clustering Using K-Means

Given a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional real vector, k-means partitions the n observations into k (≤ n) sets so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Wiki

Computational Methods

Given an initial set of k means m1(1),...,mk(1), the algorithm proceeds by alternating between two steps:

1. Assignment step - Assign each observation to the cluster with the nearest mean: that with the least squared Euclidean distance. (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means.) Wiki

2. Update step: - Recalculate means (centroids) for observations assigned to each cluster. Wiki

The iterations of the k-means algorithm are visualized below:

Determining Best Value for k

The best value for k was determined by calculating inertia for each potential value of k.

from sklearn.cluster import KMeans

inertia = []
k = list(range(1, 11))

for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(pcs_df)
    inertia.append(km.inertia_)

From the resulting elbow curve diagram it was determined that k = 4 was optimal.

K-Means Clustering Result

K-means model was then instantiated and fir with k=4.

# Initialize the K-Means model
model = KMeans(n_clusters=4, random_state=0)

# Fit the model
model.fit(df)

# Predict clusters
pred = model.predict(df)

The results were then visualized:

The table of total tradable crypto-currencies shows that the clusters are spread out along the TotalCoinSupply / TotalCoinsMined slope, seemingly undifferenciated.

But, if we look at the Principal Component 3D Scatter plot where x, y, and z are Principal Components 1, 2, and 3 respectively, it is evident that the data is in-fact clustered properly. The ambiguity of the Principal Components restricts from identifying the underlying data, and from plotting the results on raw data axes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Resources		Resources
img		img
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
crypto_clustering.ipynb		crypto_clustering.ipynb
reqs.txt		reqs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Overview

Requirements

Data

Clustering Using K-Means

Computational Methods

Determining Best Value for k

K-Means Clustering Result

About

Uh oh!

Releases

Packages

Languages

illyanyc/unit13-ClusteringCrypto

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Requirements

Data

Clustering Using K-Means

Computational Methods

Determining Best Value for k

K-Means Clustering Result

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages