visualize larger matrices #73

cornhundred · 2020-06-25T17:27:18Z

Is your feature request related to a problem? Please describe.
Clustergrammer2 can visualize matrices with ~10 million matrix cells, but slows down when there are too many columns or rows (>20,000). CyTOF data is very 'wide' in that it has ~50 dimensions but we would ideally like to visualize 50-100K columns if possible.

We are able to interactively visualize matrices with 50-100K columns using Clustergrammer2 since the visualization and interaction are handled by the GPU. However, we run into issues with interacting with the dendrogram using JavaScript on the CPU and with running the hierarchical clustering on the back-end:

hierarchical clustering takes too long (Python back-end)
front-end interactions with the linkage matrix get too slow (JavaScript dendrogram interactions)

In order to speed up the hierarchical clustering step we usually un a first round of K-means clustering and then hierarchically cluster the results (we could similarly slice the hierarchical linkage tree such that we reduce the resolution of the hierarchical clustering results).

Describe the solution you'd like

If we want to visualize the data for peace of mind and are largely satisfied identifying clusters at a resolution of ~20 data points per cluster then we could try the following:

hierarchically cluster 100K data points (will take a long time) and then trim the linkage matrix such that the maximum number of clusters allowed is ~5,000 (despite having ~100K data points)
Run K-means (e.g. 5K clusters) before hierarchical clustering, but render the original data (100K) and only allow linkage matrix interactions with the K-means clustering results.

In either case we would have to keep a dictionary of which samples belong to which K-means cluster or truncated dendrogram cluster. We will also have to update the manual category code to handle this. These do not seem like very difficult problems to overcome.

Longer Term Possible Solutions

Other longer term solutions might include

move more of the logic into the front-end GPU if possible (which is difficult becuase you need to hack WebGL into doing useful calculations https://gpu.rocks/#/) - this will be difficult to write linkage matrix traversing code in WebGL shaders and doesn't seem like a task that is parallelizable
move more of the logic to the backend (via widget JS-PY communication) - this solution may not help much (assuming Python is not much faster than JavaScript) and will only be applicable for instances where we have a running Python kernel

cornhundred · 2020-07-05T16:09:54Z

Looking into hdbscan, which is fast for medium dimensional space (50-100 dimensions) but slow for high dimensional space (>1000 dimensions)

lmcinnes/umap#25

UMAP as a pre-processing step for HDBSCAN https://umap-learn.readthedocs.io/en/latest/clustering.html

Louvain clustering https://scikit-network.readthedocs.io/en/latest/reference/hierarchy.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

visualize larger matrices #73

visualize larger matrices #73

cornhundred commented Jun 25, 2020 •

edited

Loading

cornhundred commented Jul 5, 2020 •

edited

Loading

visualize larger matrices #73

visualize larger matrices #73

Comments

cornhundred commented Jun 25, 2020 • edited Loading

Describe the solution you'd like

Longer Term Possible Solutions

cornhundred commented Jul 5, 2020 • edited Loading

cornhundred commented Jun 25, 2020 •

edited

Loading

cornhundred commented Jul 5, 2020 •

edited

Loading