Skip to content

puzzlef/louvain-communities-openmp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-threaded OpenMP-based Louvain algorithm for community detection.

The Louvain method is a greedy modularity optimization based agglomerative algorithm by Blondel et al. that efficiently identifies high-quality communities in a graph. With O(n \log n) average time complexity (n being the number of vertices in the graph), it consists of two phases: local-moving and aggregation. In the local-moving phase, each vertex greedily selects the neighboring community with the highest modularity increase. The aggregation phase merges vertices within a community into a super-vertex. This iterative process produces a dendrogram hierarchy of community memberships, with the top-level representing the final output.



Optimizations

OpenMP schedule

It appears schedule(dynamic, 2048) is the best choice.

See code, output, or sheets.

Limiting number of iterations

It appears allowing a maximum of 20 iterations is ok.

See code, output, or sheets.

Adjusting tolerance drop rate (threshold scaling)

Instead of using a fixed tolerance across all passes of the Louvain algorithm, we can start with a high tolerance and then gradually reduce it. This is called threshold scaling, and it helps minimize runtime as the first pass is usually the most expensive. It appears a drop rate of 10 is ok.

See code, output, or sheets.

Adjusting (initial) tolerance

As mentioned above, we can start with an initial high tolerance. It appears a tolerance of 0.01 is suitable.

See code, output, or sheets.

Adjusting aggregation tolerance

This controls when to consider communities to have converged based on the number of community merges. That is if too few communties merged this pass, we should stop here. Or, |Vsuper-vertices|/|V| >= aggegation tolerance => Converged.

It appears an aggregation tolerance of 0.8 might be the best choice.

See code, output, or sheets.

Vertex pruning

When a vertex changes its community, its marks its neighbors to be processed. Once a vertex has been processed, it is marked as not to be processed. It helps minimize unnecessary computation.

It appears with vertex pruning we get better timings.

See code, output, or sheets.

Finding community vertices for aggregation phase

I originally used a simple vector2d for storing vertices belonging to each community. But this requires memory allocation during the algorithm, which is expensive. Using a prefix sum along with a preallocated CSR helps to to avoid repeated mallocs and frees.

It appears using prefix sum is quite faster than using vector2d.

See code, output, or sheets.

Storing aggregated communities (super-vertex graph)

I also originally used a vector2d based dynamic graph data structure for storing the super-vertex graph. Again, this requires memory allocation during the algorithm. Using two preallocated CSRs along with parallel prefix sum can help here.

It appears using preallocated CSRs is faster.

See code, output, or sheets.

Hashtable design for local-moving/aggregation phases

One can use std::maps (C++ inbuilt map) as the hastable for Louvain. But this has poor performance. So i use a key-list and a full-size values vector (collision-free) we can dramatically improve performance. However if the memory addresses of the hastables are nearby, even if each thread uses its own hash table performance is not so high possibly due to false cache-sharing (Close-KV). However if i ensure memory address are farther away, the perf. improves (Far-KV).

It seems Far-KV has the best performance.

See code, output, or sheets.



Main results

We combine the above optimizations and observe the performance of OpenMP-based Louvain on a number of graphs. Observe that Static Louvain completes in 6.2 s on sk-2005 (a graph with 3.8 billion edges).

See code, output, or sheets.

Larger number of iterations/passes are required for graphs with lower average degree (road networks such as asia_osm and europe_osm, and protein k-mer graphs such as kmer_A2a and kmer_V1r).

Also note how graphs with poor community structure (such as com-LiveJournal, com-Orkut) have a larger time / (n log n) factor.

Finally note how most of the runtime of the algorithm is spent in the local-moving phase. Further, most of the runtime is spent in the first pass of the algorithm, which is the most expensive pass due to the size of the original graph (later passes work on super-vertex graphs).

The input data used for this experiment is available from the SuiteSparse Matrix Collection. This experiment was done with guidance from Prof. Kishore Kothapalli and Prof. Dip Sankar Banerjee.



References




ORG