-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
genieclust is very sensitive to dimensionality of the input matrix #71
Comments
Thanks! You've raised some interesting points! I would say that Genie does a pretty good job in identifying these 3 blobs that can be seen on that 2-dimensional projection of the input data set, when it works on another 50-dimensional projection generated by the PCA. However, that 2-dimensional projection is just one of many (like, uncountably many) possible ones. Oftentimes, there is rarely "the one and only / true clustering", especially when we deal with high-dimensional data. Keep in mind that Genie constructs a minimum spanning tree based on (by default) the Euclidean distances between each pair of points. So some data preprocessing (e.g., with the PCA) quite frequently is indeed necessary. In particular, here is how the MST (computed in the original space w.r.t. the Euclidean distance) looks like (when drawn on that particular 2D projection): mst = genieclust.internal.mst_from_distance(X_hidim)
genieclust.plots.plot_segments(mst[1], ad.obsm["X_umap"]) It may be the case that these points are not that well separable in the 1000-dimensional space after all. Hence, Genie will fail to separate these 3 blobs from each other. It just works on what it is fed with. However, we can choose a different metric - cosine similarity: g = genieclust.Genie(n_clusters=3, affinity='cosine')
labels = g.fit_predict(X_hidim)
ad.obs['genie_labels'] = labels.astype(str)
sc.pl.umap(ad, color='genie_labels') which yields Also - clustering w.r.t. the cosine similarity (with noise-point detection turned on) - 4 clusters - To sum-up, Genie is a stand-alone method that assumes it's being run on a somehow pre-"carved" dataset. If it's applied on raw data, well, the results can be "unrefined" as well (or "novel" if by any chance we've just made a new discovery). from sklearn.decomposition import PCA
g = genieclust.Genie(n_clusters=3)
labels = g.fit_predict(PCA(20).fit_transform(X_hidim)) # <----
ad.obs['genie_labels'] = labels.astype(str)
sc.pl.umap(ad, color='genie_labels') |
Hi,
I have learned about this nice method recently, and have been exploring it since then. I have a question about some weird results I am getting with some high dimensional gene expression datasets.
Here I am clustering a matrix of size
2700 × 1000
using default parameters (e.g.Genie(n_clusters=3)
) and visualizing the results using 2D UMAP view. I think it's very clear that there are 3 big clusters in this dataset (which is also in concordance with known biology) however genie fails to properly cluster the dataset and assign most points to a single cluster.When I reduce the dimensionality to 50 using PCA, results make more sense:
So I was wondering if there is a better way to set the parameters of the method so that clustering results make sense regardless of the dimensionality of the input matrix. What do you think? Interestingly, the original dimensionality in gene expression datasets is around 20,000 which is roughly the number of protein coding genes. So the initial matrix of size
2700 × 1000
is actually already subsetted to 1k genes.In addition, here I am plotting the graph-based clustering result using the Louvain method (which is the default clustering method in our field):
This result looks very clear and also consistent with biology, I think it'd be cool to use this as some sort of ground truth.
Code for reproducing the plots are given below. It requires scanpy and louvain python packages.
The text was updated successfully, but these errors were encountered: