Columbia FinTech Bootcamp Assignment
All tradable cryptocurrencies were clustered using the K-means model, the clusters were visualized and discussed. K-means clustering is a method of vector quantization where n observations are fit into k clusters in which each observation belongs to the cluster with the nearest mean Wiki. K-means is closely related to k-nearest neighbors.
The most common algorithm uses an iterative refinement technique, pictured below.
- k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color).
- k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means.
- The centroid of each of the k clusters becomes the new mean.
- Steps 2 and 3 are repeated until convergence has been reached.
All images were obtained from Wikipedia
A new conda environment and Jupyter Notebook/Lab should be used. Data visualization is done with Plotly and hvPlot.
conda install -c conda-forge jupyterlab
pip install pandas
pip install matplotlib
pip install -U scikit-learn
conda install hvplot
conda install -c plotly plotly
conda install -c plotly plotly_express
conda install -c conda-forge nodejs
conda update nodejsFor Jupyter Notebook support:
conda install "notebook>=5.3" "ipywidgets>=7.5"For Jupyter Lab support:
conda install jupyterlab "ipywidgets>=7.5"
# JupyterLab renderer support
jupyter labextension install jupyterlab-plotly@4.14.3
# OPTIONAL: Jupyter widgets extension
jupyter labextension install @jupyter-widgets/jupyterlab-manager plotlywidget@4.14.3Node.js v14.17.0 or higher is required to build Jupyter Lab Extensions.
The instructions on how to update Node.js can be found here: How to Easily Update Node.js to the Latest Version
-
Cryptocurrency data was obtained from using CryptoCompare API.
-
The following columns within the dataset were used in the analysis:
['CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','TotalCoinSupply']-
The dataset was filtered by:
IsTradingisTrueProofTypeis not"N/A"Nullvalues were removedTotalCoinsMined>0
-
CoinNamecolumn was isolated. -
Non-numerical values were encoded.
df_encoded = pd.get_dummies(df, columns=['Algorithm','ProofType'])- All data was scaled.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
df_scaled = StandardScaler().fit_transform(df_encoded)- Dimentionality was reduced to 3 principal components.
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
crypto_pca = pca.fit_transform(df_scaled)The resulting dataset presents the following appearance:
| Algorithm | IsTrading | ProofType | TotalCoinsMined | TotalCoinSupply | |
|---|---|---|---|---|---|
| 42 | Scrypt | True | PoW/PoS | 4.199995e+01 | 42 |
| 404 | Scrypt | True | PoW/PoS | 1.055185e+09 | 532000000 |
Given a set of observations (x1, x2, ..., xn), where each observation is a d-dimensional real vector, k-means partitions the n observations into k (≤ n) sets so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Wiki
Given an initial set of k means m1(1),...,mk(1), the algorithm proceeds by alternating between two steps:
1. Assignment step - Assign each observation to the cluster with the nearest mean: that with the least squared Euclidean distance. (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means.) Wiki
2. Update step: - Recalculate means (centroids) for observations assigned to each cluster. Wiki
The iterations of the k-means algorithm are visualized below:
The best value for k was determined by calculating inertia for each potential value of k.
from sklearn.cluster import KMeans
inertia = []
k = list(range(1, 11))
for i in k:
km = KMeans(n_clusters=i, random_state=0)
km.fit(pcs_df)
inertia.append(km.inertia_)From the resulting elbow curve diagram it was determined that k = 4 was optimal.
K-means model was then instantiated and fir with k=4.
# Initialize the K-Means model
model = KMeans(n_clusters=4, random_state=0)
# Fit the model
model.fit(df)
# Predict clusters
pred = model.predict(df)The results were then visualized:
The table of total tradable crypto-currencies shows that the clusters are spread out along the TotalCoinSupply / TotalCoinsMined slope, seemingly undifferenciated.
But, if we look at the Principal Component 3D Scatter plot where x, y, and z are Principal Components 1, 2, and 3 respectively, it is evident that the data is in-fact clustered properly. The ambiguity of the Principal Components restricts from identifying the underlying data, and from plotting the results on raw data axes.




