Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 19 additions & 19 deletions docs/hive/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# HIVE Year 1 Report: Executive Summary

This report is located online at the following URL: <https://gunrock.github.io/docs/hive_year1_summary.html>.
!> This report is located online at the following URL: <https://gunrock.github.io/docs/#/hive/>.

Herein UC Davis produces the following three deliverables that it promised to deliver in Year 1:

Expand All @@ -12,19 +12,19 @@ Specific notes on applications and scaling follow:


## Application Classification
**[Application Classification](https://gunrock.github.io/docs/hive/hive_application_classification.html)**
**[Application Classification](hive/hive_application_classification)**
Application classification involves a number of dense-matrix operations, which did not make it an obvious candidate for implementation in Gunrock. However, our GPU implementation using the CUDA CUB library shows substantial speedups (10-50x) over the multi-threaded OpenMP implementations.

However, there are two neighbor reduce operations that may benefit from the kind of load balancing implemented in Gunrock. Thus, it would be useful to either expose lightweight wrappers of high-performance Gunrock primitives for easy intergration into outside projects _or_ come up with a workflow inside of Gunrock that makes programming applications with lots of non-graph operations straightforward.

## Geolocation
**[Geolocation](https://gunrock.github.io/docs/hive/hive_geolocation.html)**
**[Geolocation](hive/hive_geolocation)**
Geolocation or geotagging is an interesting parallel problem, because it is among the few that exhibits the dynamic parallelism pattern within the compute. The pattern is as follows; there is parallel compute across nodes, each node has some serial work and within the serial work there are several parallel math operations. Even without leveraging dynamic parallelism within CUDA (kernel launches within a kernel), Geolocation performs well on the GPU environment because it mainly requires simple math operations, instead of complicated memory movement schemes.

However, the challenge within the application is load balancing this simple compute, such that each processor has roughly the same amount of work. Currently, in Gunrock, we map Geolocation using the `ForAll()` compute operator with optimizations to exit early (performing less work and fewer reads). Even without addressing load balancing issue with a complicated balancing scheme, on the HIVE datasets we achieve a 100x speedup with respect to the CPU reference code, implemented using C++ and OpenMP, and ~533x speedup with respect to the GTUSC implementation. We improve upon the algorithm by avoiding a global gather and a global synchronize, and using 3x less memory than the GTUSC reference implementation.

## GraphSAGE
**[GraphSAGE](https://gunrock.github.io/docs/hive/hive_graphSage.html)**
**[GraphSAGE](hive/hive_graphSage)**
The vertex embedding part of the GraphSAGE algorithm is implemented in the
Gunrock framework using custom CUDA kernels, utilizing block-level
parallelism, that allow a shorter running time. For the embedding part alone, the GPU
Expand All @@ -40,7 +40,7 @@ Testing on the complete workflow for prediction accuracy and running speed will
be more meaningful.

## GraphSearch
**[GraphSearch](https://gunrock.github.io/docs/hive/hive_graphsearch.html)**
**[GraphSearch](hive/hive_graphsearch)**
Graph search is a relatively minor modification to Gunrock's random walk application, and was straightforward to implement. Though random walks are a "worst case scenario" for GPU memory bandwidth, we still achieve 3--5x speedup over a modified version of the OpenMP reference implementation.

The original OpenMP reference implementation actually ran slower with more threads -- we fixed the bugs, but the benchmarking experience highlights the need for performant and hardened CPU baselines.
Expand All @@ -50,7 +50,7 @@ Until recently, Gunrock did not support parallelism _within_ the lambda function
In an end-to-end graph search application, we'd need to implement the scoring function as well as the graph walk component. For performance, we'd likely want to implement the scoring function on the GPU as well, which makes this a good example of a "Gunrock+X" app, where we'd need to integrate the high-performance graph processing component with arbitrary user code.

## Community Detection (Louvain)
**[Community Detection (Louvain)](https://gunrock.github.io/docs/hive/hive_louvain.html)**
**[Community Detection (Louvain)](hive/hive_louvain)**
The Gunrock implementation uses sort and segmented reduce to implement the
Louvain algorithm, different from the commonly used hash table mapping. The GPU
implementation is about ~1.5X faster than the OpenMP implementation, and also
Expand All @@ -63,56 +63,56 @@ implementation should have moderate scalability across multiple GPUs in an
DGX-1.

## Local Graph Clustering (LGC)
**[Local Graph Clustering (LGC)](https://gunrock.github.io/docs/hive/hive_pr_nibble.html)**
**[Local Graph Clustering (LGC)](hive/hive_pr_nibble)**
This variant of local graph clustering (L1 regularized PageRank via FISTA) is a natural fit for Gunrock's frontier-based programming paradigm. We observe speedups of 2-3 orders of magnitude over the HIVE reference implementation.

The reference implementation of the algorithm was not explicitly written as `advance`/`filter`/`compute` operations, but we were able to quickly determine how to map the operations by using [a lightweight Python implementation of the Gunrock programming API](https://github.com/gunrock/pygunrock/blob/master/apps/pr_nibble.py) as a development environment. Thus, LGC was a good exercise in implementing a non-trivial end-to-end application in Gunrock from scratch.

## Graph Projections
**[Graph Projections](https://gunrock.github.io/docs/hive/hive_proj.html)**
**[Graph Projections](hive/hive_proj)**
Because it has a natural representation in terms of sparse matrix operations, graph projections gave us an opportunity to compare ease of implementation and performance between Gunrock and another UC-Davis project, GPU [GraphBLAS](https://github.com/owensgroup/GraphBLAS).

Overall, we found that Gunrock was more flexible and more performant than GraphBLAS, likely due to better load balancing. However, in this case, the GraphBLAS application was substantially easier to program than Gunrock, and also allowed us to take advantage of some more sophisticated memory allocation methods available in the GraphBLAS cuSPARSE backend. These findings suggest that addition of certain commonly used API functions to Gunrock could be a fruitful direction for further work.

## Scan Statistics
**[Scan Statistics](https://gunrock.github.io/docs/hive/hive_scan_statistics.html)**
**[Scan Statistics](hive/hive_scan_statistics)**
Scan statistics applied to static graphs fits perfectly into the Gunrock framework. Using a combination of `ForAll` and Intersection operations, we outperform the parallel OpenMP CPU reference by up to 45.4 times speedup on the small Enron graph (provided as part of the HIVE workflows) and up to a 580 times speedup on larger graphs that feature enough computation to saturate the throughput of the GPU.

## Seeded Graph Matching (SGM)
**[Seeded Graph Matching (SGM)](https://gunrock.github.io/docs/hive/hive_sgm.html)**
**[Seeded Graph Matching (SGM)](hive/hive_sgm)**
SGM is a fruitful workflow to optimize, because the existing implementations were not written with performance in mind. By making minor modifications to the algorithm that allow use of sparse data structures, we enable scaling to larger datasets than previously possible.

The application involves solving a linear assignment problem (LSAP) as a subproblem. Solving these problems on the GPU is an active area of research -- though papers have been written describing high-performance parallel LSAP solvers, reference implementations are not available. We implement a GPU LSAP solver via Bertsekas' auction algorithm, and make it available as a [standalone library](https://github.com/bkj/cbert).

SGM is an approximate algorithm that minimizes graph adjacency disagreements via the Frank-Wolfe algorithm. Certain uses of the auction algorithm can introduce additional approximation in the gradients of the Frank-Wolfe iterations. An interesting direction for future work would be a rigorous study of the effects of this kind of approximation on a variety of different graph tolopogies. Understanding of those dynamics could allow further scaling beyond what our current implementations can handle.

## Sparse Fused Lasso
**[Sparse Fused Lasso](https://gunrock.github.io/docs/hive/hive_sparse_graph_trend_filtering.html)**
**[Sparse Fused Lasso](hive/hive_sparse_graph_trend_filtering)**
The SFL problem is mainly divided into two parts, computing residual graphs from maxflow and renormalizing the weights of the vertices. Maxflow is performed with the parallelizable lock-free push-relabel algorithm. For renormalization: each vertex has to compute which communities it belongs to, and normalize the weights with other vertices in the same community. SFL iterates on maxflow and renormalization kernels with a global synchronization in between them until convergence. Current analysis show that maxflow is the bottleneck of the whole workflow, with over 90% of the runtime being spent on the maxflow kernels.

GPU SFL runs ~2 times slower than the CPU benchmark on the largest dataset provided. On smaller datasets, GPU SFL is much slower because there just isn't enough work to fill up a GPU and leverage the compute we have available. Analyzing the runs on the larger datasets, show that the parametric maxflow on the CPU converges much faster than our parallel push-relabel max flow algorithm on the GPU. Investigating the parallelization of parametric maxflow is an interesting research challenge.

## Vertex Nomination
**[Vertex Nomination](https://gunrock.github.io/docs/hive/hive_vn.html)**
**[Vertex Nomination](hive/hive_vn)**
The term "vertex nomination" covers a variety of different node ranking schemes that fuse "content" and "context" information. The HIVE reference code implements a "multiple-source shortest path" context scoring function, but uses a very suboptimal algorithm. By using a more efficient algorithm, our serial CPU implementation achieves 1-2 orders of magnitude speedup over the HIVE implementation and our GPU implementation achieves another 1-2 orders of magnitude on top of that. Implementation was straightforward, involving only a small modification to the existing Gunrock SSSP app.

## Scaling analysis for HIVE applications
**[Scaling analysis for HIVE applications](https://gunrock.github.io/docs/hive/hive_scaling.html)**
**[Scaling analysis for HIVE applications](hive/hive_scaling)**
Scaling summary:

| Application | Computation to communication ratio | Scalability | Impl. difficulty |
|---------------------------------|-------------------------------------------------|----------------|------------------|
| Louvain | $E/p : 2V$ | Okay | Hard |
| Graph SAGE | $\sim CF : \min(C, 2p) \cdot 4$ | Good | Easy |
| Random walk | Duplicated graph: infinity \linebreak Distributed graph: $1 : 24$ | Perfect \linebreak Very poor | Trivial \linebreak Easy |
| Random walk | Duplicated graph: infinity <br> Distributed graph: $1 : 24$ | Perfect <br> Very poor | Trivial <br> Easy |
| Graph search: Uniform | $1 : 24$ | Very poor | Easy |
| Graph search: Greedy | Straightforward: $d : 24$ \linebreak Pre-visit: $1:24$ | Poor \linebreak Very poor | Easy \linebreak Easy |
| Graph search: Stochastic greedy | Straightforward: $d : 24$ \linebreak Pre-visit: $\log(d) : 24$ | Poor \linebreak Very poor | Easy \linebreak Easy |
| Geolocation | Explicit movement: $25E/p : 4V$ \linebreak UVM or peer access: $25 : 1$ | Okay \linebreak Good | Easy \linebreak Easy |
| Graph search: Greedy | Straightforward: $d : 24$ <br> Pre-visit: $1:24$ | Poor <br> Very poor | Easy <br> Easy |
| Graph search: Stochastic greedy | Straightforward: $d : 24$ <br> Pre-visit: $\log(d) : 24$ | Poor <br> Very poor | Easy <br> Easy |
| Geolocation | Explicit movement: $25E/p : 4V$ <br> UVM or peer access: $25 : 1$ | Okay <br> Good | Easy <br> Easy |
| Vertex nomination | $E : 8V \cdot \min(d, p)$ | Okay | Easy |
| Scan statistics | Duplicated graph: infinity \linebreak Distributed graph: $\sim (d+a \cdot \log(d)):12$ | Perfect \linebreak Okay | Trivial \linebreak Easy |
| Scan statistics | Duplicated graph: infinity <br> Distributed graph: $\sim (d+a \cdot \log(d)):12$ | Perfect <br> Okay | Trivial <br> Easy |
| Sparse fused lasso | $\sim a:8$ | Less than okay | Hard |
| Graph projection | Duplicated graph : infinity \linebreak Distributed graph : $dE/p + E' : 6E'$ | Perfect \linebreak Okay | Easy \linebreak Easy |
| Graph projection | Duplicated graph : infinity <br> Distributed graph : $dE/p + E' : 6E'$ | Perfect <br> Okay | Easy <br> Easy |
| Local graph clustering | $(6 + d)/p : 4$ | Good | Easy |
| Seeded graph matching | | | |
| Application classification | | | |
Expand Down
12 changes: 0 additions & 12 deletions docs/hive/hive_geolocation.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,3 @@
---
title: Geolocation (HIVE)

toc_footers:
- <a href='https://github.com/gunrock/gunrock'>Gunrock&colon; GPU Graph Analytics</a>
- Gunrock &copy; 2018 The Regents of the University of California.

search: true

full_length: true
---

# Geolocation

Infers user locations using the location (latitude, longitude) of friends through spatial label propagation. Given a graph `G`, geolocation examines each vertex `v`'s neighbors and computes the spatial median of the neighbors' location list. The output is a list of predicted locations for all vertices with unknown locations.
Expand Down
12 changes: 0 additions & 12 deletions docs/hive/hive_graphSage.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,3 @@
---
title: HIVE workflow report for GraphSAGE GPU implementation

toc_footers:
- <a href='https://github.com/gunrock/gunrock'>Gunrock&colon; GPU Graph Analytics</a>
- Gunrock &copy; 2018 The Regents of the University of California.

search: true

full_length: true
---

# GraphSAGE

GraphSAGE is a way to fit graphs into a neural network: instead of getting the
Expand Down
12 changes: 0 additions & 12 deletions docs/hive/hive_graphsearch.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,3 @@
---
title: Graph Search (HIVE)

toc_footers:
- <a href='https://github.com/gunrock/gunrock'>Gunrock&colon; GPU Graph Analytics</a>
- Gunrock &copy; 2018 The Regents of the University of California.

search: true

full_length: true
---

# GraphSearch

The graph search (GS) workflow is a walk-based method that searches a graph for nodes that score highly on some arbitrary indicator of interest.
Expand Down
12 changes: 0 additions & 12 deletions docs/hive/hive_louvain.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,3 @@
---
title: HIVE workflow report for Louvain GPU implementation

toc_footers:
- <a href='https://github.com/gunrock/gunrock'>Gunrock&colon; GPU Graph Analytics</a>
- Gunrock &copy; 2018 The Regents of the University of California.

search: true

full_length: true
---

# Community Detection (Louvain)

Community detection in graphs means grouping vertices together, so that those vertices
Expand Down
12 changes: 0 additions & 12 deletions docs/hive/hive_pr_nibble.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,3 @@
---
title: Local Graph Clustering (HIVE)

toc_footers:
- <a href='https://github.com/gunrock/gunrock'>Gunrock&colon; GPU Graph Analytics</a>
- Gunrock &copy; 2018 The Regents of the University of California.

search: true

full_length: true
---

# Local Graph Clustering (LGC)

From [Andersen et al.](https://projecteuclid.org/euclid.im/1243430567):
Expand Down
12 changes: 0 additions & 12 deletions docs/hive/hive_proj.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,3 @@
---
title: Graph Projections (HIVE)

toc_footers:
- <a href='https://github.com/gunrock/gunrock'>Gunrock&colon; GPU Graph Analytics</a>
- Gunrock &copy; 2018 The Regents of the University of California.

search: true

full_length: true
---

# Graph Projections

Given a (directed) graph `G`, graph projection outputs a graph `H` such that `H` contains edge `(u, v)` iff `G` contains edges `(w, u)` and `(w, v)` for some node `w`. That is, graph projection creates a new graph where nodes are connected iff they are neighbors of the same node in the original graph. Typically, the edge weights of `H` are computed via some (simple) function of the corresponding edge weights of `G`.
Expand Down
18 changes: 4 additions & 14 deletions docs/hive/hive_scaling.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,3 @@
---
title: Scaling analysis for HIVE applications

toc_footers:
- <a href='https://github.com/gunrock/gunrock'>Gunrock&colon; GPU Graph Analytics</a>
- Gunrock &copy; 2018 The Regents of the University of California.

search: true

full_length: true
---
# Scaling analysis for HIVE applications

The purpose of this study is to understand how the HIVE v0
Expand All @@ -27,8 +16,9 @@ partitioning schemes, thus may have different scaling results.
### DGX-1

The DGX-1 with P100 GPUs has 4 NVLink lanes per GPU, connected as
follows. ![DGX1-NVLink](../attachments/scaling/NVLink-DGX1.png "DGX1
NVLink Topology")
follows.

![DGX1-NVLink](_media/attachments/scaling/NVLink-DGX1.png)

Each of the NVLink links runs at 20 GBps per direction, higher than
PCIe 3.0 x16 (16 GBps for the whole GPU). But the topology is not
Expand Down Expand Up @@ -107,7 +97,7 @@ driver 410 shows considerablly better throughputs.
The DGX-2 system has a very different NVLink topology: the GPUs are
connected by NVSwitches, and all to all peer accesses are available.

![DGX2-NVLink](../attachments/scaling/NVLink-DGX2.png "DGX2 NVLink Topology").
![DGX2-NVLink](_media/attachments/scaling/NVLink-DGX2.png)

At the time of this report, the DGX-2 is hardly available, and not to
us. What we have locally at UC Davis are two Quadro GV100 GPUs
Expand Down
12 changes: 0 additions & 12 deletions docs/hive/hive_scan_statistics.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,3 @@
---
title: Template for HIVE Scan Statistics workflow report

toc_footers:
- <a href='https://github.com/gunrock/gunrock'>Gunrock&colon; GPU Graph Analytics</a>
- Gunrock &copy; 2018 The Regents of the University of California.

search: true

full_length: true
---

# Scan Statistics

Scan statistics, as described in [Priebe et al.](http://www.cis.jhu.edu/~parky/CEP-Publications/PCMP-CMOT2005.pdf), is the generic method that computes a statistic for the neighborhood of each node in the graph, and looks for anomalies in those statistics. In this workflow, we implement a specific version of scan statistics where we compute the number of edges in the subgraph induced by the one-hop neighborhood of each node $u$ in the graph. It turns out that this statistic is equal to the number of triangles that node $u$ participates in plus the degree of $u$. Thus, we are able to implement scan statistics by making relatively minor modifications to our existing Gunrock triangle counting (TC) application.
Expand Down
12 changes: 0 additions & 12 deletions docs/hive/hive_sgm.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,3 @@
---
title: Seeded Graph Matching (HIVE)

toc_footers:
- <a href='https://github.com/gunrock/gunrock'>Gunrock&colon; GPU Graph Analytics</a>
- Gunrock &copy; 2018 The Regents of the University of California.

search: true

full_length: true
---

# Seeded Graph Matching (SGM)

From [Fishkind et al.](https://arxiv.org/pdf/1209.0367.pdf):
Expand Down
Loading