Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

visualize larger matrices #73

Open
cornhundred opened this issue Jun 25, 2020 · 1 comment
Open

visualize larger matrices #73

cornhundred opened this issue Jun 25, 2020 · 1 comment

Comments

@cornhundred
Copy link
Contributor

cornhundred commented Jun 25, 2020

Is your feature request related to a problem? Please describe.
Clustergrammer2 can visualize matrices with ~10 million matrix cells, but slows down when there are too many columns or rows (>20,000). CyTOF data is very 'wide' in that it has ~50 dimensions but we would ideally like to visualize 50-100K columns if possible.

We are able to interactively visualize matrices with 50-100K columns using Clustergrammer2 since the visualization and interaction are handled by the GPU. However, we run into issues with interacting with the dendrogram using JavaScript on the CPU and with running the hierarchical clustering on the back-end:

  • hierarchical clustering takes too long (Python back-end)
  • front-end interactions with the linkage matrix get too slow (JavaScript dendrogram interactions)

In order to speed up the hierarchical clustering step we usually un a first round of K-means clustering and then hierarchically cluster the results (we could similarly slice the hierarchical linkage tree such that we reduce the resolution of the hierarchical clustering results).

Describe the solution you'd like

If we want to visualize the data for peace of mind and are largely satisfied identifying clusters at a resolution of ~20 data points per cluster then we could try the following:

  • hierarchically cluster 100K data points (will take a long time) and then trim the linkage matrix such that the maximum number of clusters allowed is ~5,000 (despite having ~100K data points)
  • Run K-means (e.g. 5K clusters) before hierarchical clustering, but render the original data (100K) and only allow linkage matrix interactions with the K-means clustering results.

In either case we would have to keep a dictionary of which samples belong to which K-means cluster or truncated dendrogram cluster. We will also have to update the manual category code to handle this. These do not seem like very difficult problems to overcome.

Longer Term Possible Solutions

Other longer term solutions might include

  • move more of the logic into the front-end GPU if possible (which is difficult becuase you need to hack WebGL into doing useful calculations https://gpu.rocks/#/) - this will be difficult to write linkage matrix traversing code in WebGL shaders and doesn't seem like a task that is parallelizable
  • move more of the logic to the backend (via widget JS-PY communication) - this solution may not help much (assuming Python is not much faster than JavaScript) and will only be applicable for instances where we have a running Python kernel
@cornhundred
Copy link
Contributor Author

cornhundred commented Jul 5, 2020

Looking into hdbscan, which is fast for medium dimensional space (50-100 dimensions) but slow for high dimensional space (>1000 dimensions)

lmcinnes/umap#25

UMAP as a pre-processing step for HDBSCAN https://umap-learn.readthedocs.io/en/latest/clustering.html

Louvain clustering https://scikit-network.readthedocs.io/en/latest/reference/hierarchy.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant