Quality Metrics for evaluating the inter-cluster reliability of Mutldimensional Projections
Docs
·
Paper
·
Contact
We cannot trust the embedding results (i.e., the results of multidimensional projections (MDP) such as t-SNE, UMAP, or PCA). As distortions inherently occur when reducing dimensionality, meaningful patterns in projections can be less trustworthy and thus disturb users’ accurate comprehension of the original data, leading to interpretation bias. Therefore, it is vital to measure the overall distortions using quantitative metrics or visualize where and how the distortions occurred in the projection.
So- which aspects of MDP should we evaluate? There are numerous criteria to test if MDP well preserved the characteristics of the original high-dimensional data. Here, we focus on inter-cluster reliability, representing how well the projection depicts the inter-cluster structure (e.g., number of clusters, outliers, the distance between clusters...). It is important for MDP to have high inter-cluster reliability, as cluster analysis is one of the most critical tasks in MDP.
However, previous local metrics to evaluate MDP (e.g., Trustworthiness & Continuity, Mean Relative Rank Errors) focused on measuring the preservation of nearest neighbors or naively checked the maintenance of predefined clustering results or classes. These approaches cannot properly measure the reliability of the complex inter-cluster structure.
By repeatedly extracting a random cluster from one space and measuring how well the cluster stays still in the opposite space, Steadiness & Cohesiveness measure inter-cluster reliability. Note that Steadiness measures the extent to which clusters in the projected space form clusters in the original space, and Cohesiveness measures the opposite.
For more details, please refer to our paper.
If you have trouble using Steadiness & Cohesiveness in your project or research, feel free to contact us (hj@hcil.snu.ac.kr). We appreciate all requests about utilizing our metrics!!
Steadiness and Cohesiveness can be installed via pip
pip install snc
from snc.snc import SNC
...
parameter = { "k": 'sqrt', "alpha": 0.1 }
metrics = SNC(
raw=raw_data,
emb=emb_data,
iteration=300,
dist_parameter = parameter
)
metrics.fit()
print(metrics.steadiness(), metrics.cohesiveness())
if you installed Steadiness & Cohesiveness outside your project directory:
import sys
sys.path.append("/absolute/path/to/steadiness-cohesiveness")
from snc import SNC
...
As there exists a number of parameters for Steadiness & Cohesiveness, we recommend you to use the default setting (which is described in our paper) by locating the original data into raw
and projection data into emb
as arguments.
raw
: the original (raw) high-dimensional data which used to generate multidimensional projections. Should be a 2D array (or a 2D np array) with shape(n_samples, n_dim)
wheren_samples
denotes the number of data points in dataset andn_dim
is the original size of dimensionality (number of features).emb
: the projected (embedded) data ofraw
(i.e., MDP result). Should be a 2D array (or a 2D np array) with shape(n_samples, n_reduced_dim)
wheren_reduced_dim
denotes the dimensionality of projection.
Refer API description for more details about hyperparameter setting.
class SNC(
raw,
emb,
iteration=150,
walk_num_ratio=0.3,
dist_strategy="snn",
dist_paramter={},
dist_function=None,
cluster_strategy="dbscan",
snn_knn_matrix=None
)
raw
:Array, shape=(n_samples, n_dim), dtype=float or int
- The original (raw) high-dimensional data used to generate MDP result.
n_samples
: the number of data points in dataset /n_dim
: is the original size of dimensionality
emb
:Array, shape=(n_samples, n_reduced_dim), dtype=float or int
- The projected (embedded) data of
raw
.n_reduced_dim
: dimensionality of the projection
iteration
:int, (optional, default: 150)
- The number of partial distortion computation (extracting => evaluating maintainence in the opposite side).
- Higher
iteration
generates the more deterministic / reliable result, but computation time increases linearly toiteration
.- We recommend 150 iterations as a minimum.
walk_num_ratio
:float, (optional, default: 0.3)
- The amount of traverse held to extract a cluster.
- For a data with
n_samples
samples, the total traverse number to extract a cluster isn_samples * walk_num_ratio
.- The size of extracted cluster grows as
walk_num_ratio
increases, but does not effect the result significantly.
dist_strategy
:string, (optional, default: "snn")
- The selection of the way to compute distance.
- We currently support:
"snn"
: utilizes Shared Nearest Neighbor based on dissimilarity"euclidean"
"predefined"
: allows user-defined distance function"inject_snn"
: inject knn and snn info- We highly recommend to use default distance strategy "snn".
- If you set
dist_strategy
as "predefined", you should also explicitly pass the way to compute distance asdist_function
parameter. THe distance for cluster automatically computed as average linkage.
dist_parameter
:dict, (optional, default: { "alpha": 0.1, "k": 'sqrt' })
- Parameters for distance computations
- if
dist_strategy == "snn
,dist_parameter
dictionary should hold:
"alpha"
:float, (optional, default: 0.1)
- The hyperparameter which panalizes low similarity between data points / clusters.
- A low
"alpha"
converts smaller similarities to higher dissimilarities (distances)."k"
:int or string, (optional, default: 'sqrt')
- The number of nearest neighbors for
k
-Nearest Neighbror graph which becomes a basis to compute SNN similarity.- If
k == 'sqrt'
,k
is set as the square root of the length of data- if
dist_parameter == "euclidean"
,dist_parameter
does nothing.- if
dist_parameter == "predefined"
, you can freely utilizedist_parameter
indist_function
.
- Note that unlike
"snn"
and"euclidean"
, the computation of "predefined" is not parallelized, thus requries much time to be computed
dist_function
:function, (optional, default: None)
- If you set
dist_strategy
as"predefined"
, you should pass the function to calculate distance as parameter (otherwise the class raises error).- The function must get three parameters as arguments: two points
a
,b
, their lengthn_dim
, anddist_parameter
which is given by user.
a
andb
will be 1D numpy array with sizen_dim
n_dim
will be an integer number- return value should be a single float value which denotes the distance between
a
andb
.
cluster_strategy
:string, (optional, default: "dbscan")
- Remind: Steadiness and Cohesiveness measures inter-cluster reliability by checking the maintenance of clusters from one space in the opposite space (Refer to Why Steadiness and Cohesiveness).
- This is done by again "clustering" the cluster in the opposite side and measuring how much the cluster is splitted.
cluster_strategy
is a hyperparameter to determine the way to conduct such "clustering- We currently supports:
"dbscan"
: based on density-based clustering algorithm, mainly utilizing HDBSCAN ligrary."x-means"
: based on X-Means clustering algorithm"'K'-means"
: based on K-Means clustering algorithm, where users can freely change'K'
value by substituting it with integer number.
- e.g.,
15-means
,20-means
, etc.snn_knn_matrix
:dict, (optional, default: None)
- If you want to inject precomputed snn and knn, use this parameter
- To inject the parameter, you should set
dist_strategy
as"inject_snn"
- The dictionary should hold:
"raw_snn"
:Array, shape=(n_samples, n_samples), dtype=float
, the snn matrix of raw data"raw_knn"
:Array, shape=(n_samples, n_samples), dtype=float
, the knn matrix of raw data"emb_snn"
:Array, shape=(n_samples, n_samples), dtype=float
, the snn matrix of embedded data"emb_knn"
:Array, shape=(n_samples, n_samples), dtype=float
, the knn matrix of embedded data
SNC.fit(record_vis_info=False)
Initializating Steadiness & Cohesiveness : Preprocessing (e.g., distance matrix computation) and preparation for computing Steadiness and Cohesiveness.
record_vis_info
:bool, (optional, default: False)
- If
True
, SNC object records the information needed to make distortion visualization of Steadiness & Cohesiveness- method
vis_info()
becomes able to called when the parameter is set asTrue
- Recording the informations incurs extra overhead!!
SNC.steadiness()
SNC.cohesiveness()
Performs the main computation of Steadiness and Cohesiveness and return the result. Note that this step generates large proportion of the computation.
SNC.vis_info(file_path=None, label=None, k=10)
Able to be performed when
record_vis_info
parameter is set asTrue
(Otherwise raises Error)
file_path
:string, (optional, default: None)
- if
file_path
is not given as arugment, returns visualization infos- if
file_path
is given, visualization info is saved in the file with designated name (and path)- if you only designate the directory (
file_path
ends with/
), info is saved asinfo.json
inside the directory
label
:Array, (optional, default: None), shape=(n_samples), dtype=int
- 1D array which holds the label (class) information of dataset
- if
None
, all points are considered to have a identical label "0"
k
:int, (optional, default: 10)
- the
k
value for constructing kNN graph used in visualization
This section provides some examples to show how Steadiness and Cohesiveness respond to the projections with diverse qualities and characteristics. For more detailed experiments and evaluations, please refer to our paper.
UMAP has two important hyperparameters: n_neighbors
(nearest neighbors) and min_dist
. Here, we test how Steadiness and Cohesiveness vary against UMAP embeddings with increasing n_neighbors
. n_neighbors
denotes the number of nearest neighbors used to formulate the graph representing the original structure of data. Low n_neighbors
values make UMAP focus more on local structure, while high values work in the opposite.
The Spheres and the Mammoth dataset are used for the test. The Spheres dataset consists of eleven spheres living in a 101-dimensional space. The dataset represents ten spheres with each consisting of 250 points, which are enclosed by another larger sphere with 2,500 points. The Mammoth dataset consists of 5,000 points constituting the 3D structure of a mammoth skeleton.
The UMAP projections of Mammoth (upper row) Spheres (bottom row) dataset with increasing n_neighbors
are as follows:
Let's first examine the projections carefully. For the Mammoth dataset, projections with larger n_neighbors
preserves the skeleton structure. For the Spheres dataset, you can see the points from the outer sphere (blue points) escapes from the cluster mainly formed by inner spheres when n_neighbors
grow. Therefore, we can intuitively indicate that the projections with larger n_neighbors
values better preserve the original inter-cluster structure for both datasets.
Now it's time to conduct a test!! We applied Steadiness & Cohesiveness (with default hyperparameter setting) to the projections. Previous local metrics (Trustworthiness & Continuity, Mean Relative Rank Errors) with k=10 and global metrics (Stress, DTM) were also applied for the comparison. For global metrics, we used the values that are subtracted from 1, to assign lower values to low-quality projections.
As a result, we can find that Steadiness well captured the increment of the projection quality occurred by the increasing n_neighbors
values, both for Mammoth (left) and Spheres (right).
However, Cohesiveness increased only for the Mammoth dataset.
Still, other metrics, except DTM, failed to capture the increment of projection quality.
Then how about min_dist
? This time, we generated projections of Fashion-MNIST dataset with increasing min_dist
.
Alike n_neighbors
, low min_dist
values pack points together (focusing on local structure), and high values do the opposite.
The projections and their evaluation result is as follows.
We previously noted that large min_dist
values make projections focus more on the global structure. Thus, the decrement of Trustworthiness and MRRE [Missing] is quite natural, as they focus on small local structures around each point. The surprising thing here is that Cohesiveness increases as min_dist
increases. This result indicates that classes of the Fashion-MNIST dataset are not well separated as represented in the projections with a low min_dist
value.
According to our case study (refer to the paper, it is common to perceive that projections with well-divided clusters better reflect the inter-cluster structure; this result shows that such a common perception could lead to a misinterpretation of inter-cluster structure.
MDP metrics must capture the obvious quality degradation or increment. To test our metrics' ability to capture such alteration, we conducted two tests utilizing PCA. In the first experiment, we generated 2D PCA projections of MNIST dataset by utilizing principal component pairs with decreasing ranks (from (1st, 2nd) to (21th, 22nd)). The projections with low-rank principal components will have a lower score, as they cannot well preserve the variance of the dataset. For the second experiment, we varied the number of principal components from 2 to 22. Obviously, the quality of projections should be increase when they can utilize more principal components (i.e., lie in higher dimension).
Unsurprisingly, all metrics, including Steadiness and Cohesiveness, well captured the quality alteration for both the first (left) and the second (right) experiment.
By visualizing the result of Steadiness and Cohesiveness through the reliability map, we get more insight into how inter-cluster structure is distorted in MDP. You only need to inject visualization info file generated by vis_info
method.
Please check relability map repository and follow the instructions to visualize Steadiness and Cohesiveness on your web browser.
The reliability map also supports interactions to show Missing Groups — please enjoy it!!
If you have used Steadiness & Cohesvieness for your project and wish to reference it, please cite our TVCG paper.
H. Jeon, H.-K. Ko, J. Jo, Y. Kim, and J. Seo, “Measuring and explaining the inter-cluster reliability of multidimensional projections,” IEEE Transactions on Visualization and Computer Graphics (TVCG, Proc. VIS), 2021. to appear.
@article{jeon21tvcg,
author={Jeon, Hyeon and Ko, Hyung-Kwon and Jo, Jaemin and Kim, Youngtaek and Seo, Jinwook},
journal={IEEE Transactions on Visualization and Computer Graphics (TVCG, Proc. VIS)},
title={Measuring and Explaining the Inter-Cluster Reliability of Multidimensional Projections},
year={2021},
note={to appear.}
}
Hyeon Jeon, Hyung-Kwon Ko, Jaemin Jo, Youngtaek Kim, and Jinwook Seo.
This software is mainly developed / maintained by Human-computer Interaction Laboratory @ Seoul National University.