General regularization selection method for sparse undirected graphical models with (distinct) cluster structure based on gap statistic.
For now, use the R-script "gap_com.R".
Gap-com measures the difference between the estimated number of graph clusters and expected number of graph clusters. The expected number of graph clusters can be derived either by
(i) using so called reference distribution (permuted data set).
(ii) using so called reference graph (Erdos-Renyi model, aka. random graph model) where the probability for drawing an edge between two arbitrary vertices is defined from the permuted data.
A generic R implementation of gap-com is straightforward. Assume that the sparse network estimation method corresponds to a function f depending on a parameter lambda which controls the graph sparsity and f returns a list of sparse estimates with increasing sparsity level (the sparsest model first and the densest last), Y is an n x p data matrix, detect.cluster is the community/cluster detection algorithm, sparsity returns the sparsity level of a graph and B is the number of permuted data sets,
> Graphs = f(Y, lambda)
> nlambda = length(lambda)
> k = rep(0, nlambda)
> for(i in 1:nlambda){
+ Clusters = detect.cluster(Graps[i]) # community detection
+ k[i] = length(Clusters) # nmb of estimated clusters
+}
> Expk = matrix(0, B, nlambda)
> for(b in 1:B){
+ if(usePermutation){
+ YNULL = Y[sample(1:nrow(Y)), ]
+ RefGraphs = f(YNULL, lambda)
+ for(i in 1:nlambda){
+ RefClusters = detect.cluster(RefGraphs[i]) # community detection
+ Expk[b, i] = length(RefClusters) # nmb of estimated clusters from reference data
+ }
+ }
+ if(useReferenceGraph){
+ if(b == 1){
+ YNULL = Y[sample(1:nrow(Y)), ]
+ DummyGraphs = f(YNULL, lambda)
+ }
+ for(i in 1:nlambda){
+ RefGraphs = erdos.renyi.game(p, sparsity(DummyGraphs[i]) , type="gnp") # see igraph
+ RefClusters = detect.cluster(RefGraphs[i]) # community detection
+ Expk[b, i] = length(RefClusters) # nmb of estimated clusters from reference graph
+ }
+ }
+}
> Expk = colMeans(Expk) # The expected nmb of clusters under the (i) reference distribution or (ii) reference graph
> Gap_lambda = Expk - k
> GapIndex = which.max(Gap_lambda)
> opt.lambda = lambda[GapIndex]
which returns gap-statistic values (Gap_lambda), the index of the largest regularization parameter which maximizes the gap-statistics (GapIndex) and the largest value of the regularization parameter which maximizes gap-statistics (opt.lambda).
library(huge)
library(igraph)
library(ggplot2)
source("gap_com.R")
##########################################################
# Initialize seed number:
#seed = Sys.time()
#seed = as.integer(seed)
#seed = seed %% 100000
seed = 29734
set.seed(seed)
##########################################################
L = huge.generator(d = 200, n = 500, graph = "cluster", g = 7)
Y = L$data
nlambda = 50
HugeSolutionPath = huge(Y, method = "ct", nlambda = nlambda)
gapPermuteLambda = gap_com(HugeSolutionPath, verbose = T, Plot = T, B = 50, method = "permute_sample") # reference distribution (permuted data matrix rows)
gapERLambda = gap_com(HugeSolutionPath, verbose = T, Plot = T, B = 50, method = "er_sample") # Erdos-Renyi model
huge.plot(L$theta)
title("Ground truth")
In this example, the graphs selected using permutations or the reference graph resampling are not exactly the same but very close to each other. The community structre of the two graphs is identical,
gapPermuteLambda$opt.index
[1] 29
gapERLambda$opt.index
[1] 30
GGapPermute = graph.adjacency(HugeSolutionPath$path[[gapPermuteLambda$opt.index]], mode="undirected")
GGapER = graph.adjacency(HugeSolutionPath$path[[gapERLambda$opt.index]], mode="undirected")
igraph::graph.isomorphic(GGapPermute, GGapER)
[1] FALSE
GapPermuteCommunities = igraph::walktrap.community(GGapPermute)
GapERCommunities = igraph::walktrap.community(GGapER)
compare(GapERCommunities, GapPermuteCommunities, method="adjusted.rand") # closer to one = closer to each other
[1] 1
huge.plot(HugeSolutionPath$path[[gapPermuteLambda$opt.index]])
title("gap-com, Permuted data (pairwise correlation hard thresholding)")
huge.plot(HugeSolutionPath$path[[gapERLambda$opt.index]])
title("gap-com, ER sample (pairwise correlation hard thresholding)")
The identified communities are practically identical to the ground truth clusters,
TrueG = graph.adjacency(L$theta, mode = "undirected", diag = F)
TrueCommunities = igraph::walktrap.community(TrueG)
compare(GapERCommunities, TrueCommunities, method="adjusted.rand") # close to one = better
[1] 1
We provide an implementation which uses parallel computing. We recommend to use it when the number of variables is large,
L = huge.generator(d = 1000, n = 500, graph = "cluster", g = 7)
Y = L$data
HugeSolutionPath = huge(Y, method = "ct", nlambda = nlambda)
# Without parallel:
system.time(GapLambdaER <- gap_com(HugeSolutionPath, B = 50, method = "er_sample"))
user system elapsed
87.27 1.07 88.34
# With parallel. Use the script "gap_com_parallel.R":
library(foreach)
library(parallel)
library(doParallel)
source("gap_com_parallel.R")
registerDoParallel(cores=6)
system.time(GapLambdaERPar <- gap_com_parallel(HugeSolutionPath, B = 50, method = "er_sample"))
user system elapsed
7.17 0.60 24.42
Here the number of clusters is used as a metric in the gap-com statistic which is in line with the original gap statistic but it is totally possible to use some other metric depending on application at hand. Other interesting metrics could be (not in any particular order) the modularity of the graph with respect to some division, transitivity, average(...) hub scores etc.
Gap-com statistic is described in:
Kuismin, M., Dodangeh, F., & Sillanpää, M. J. (2022). Gap-com: general model selection criterion for sparse undirected gene networks with nontrivial community structure, G3: Genes, Genomes, Genetics, 12(2), Article jkab437. https://doi.org/10.1093/g3journal/jkab437
The .zip file "CodeCollection" is a collection of R scripts which can be used to recreate Figures and analysis represented in the manuscript.