ICC workshop '20 | DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs #283

jasperzhong · 2022-03-15T06:06:23Z

https://arxiv.org/pdf/2010.05337

jasperzhong · 2022-03-15T13:22:34Z

和之前大模型训练很不一样. 分布式GNN训练主要是图太大，vertex直接有非常复杂的依赖关系，而传统的训练sample之间都是互相独立的. 所以这里的挑战是如何.

解决办法看上去不是很复杂. 首先trainer调用RPC让sampler去采样，返回sampled subgraph，然后trainer去存放node features的KV Store去取对应的node features，然后进行data-parallel训练. 如下图所示.

graph首先是被paritition成多个subgraph存放在各个机器上，vertex/edge features也随之partition. 每个机器会有一个graph sampler负责其机器上的subgraph的采样.

原来如此，那么看来采样这件事情是基本是locally的.

graph partitioning
partition算法的目的是让cross partition的edges数量最少. 这是事先做一次的. 并且会把cross-partition的edge的vertex两边都进行copy. 所以整个系统中，edge只有一份，而vertex可能会重复. 重复的vertex叫做HALO vertices，其他的叫做core vertices. 下面是一个示意图.

partition graph一个问题是load balancing. 他们formulate成一个multi-constraint partitioning问题，没形式化.

partition graph后，vertex features和edge features也随之partition. 但是，HALO vertices的features不会duplicated. 这样，所有的vertex features和edge features都不会duplicated.

Distributed KV-Store
内部用shared memory作为IPC. 还是会有跨机通信.

Distributed Sampler
trainer用RPC请求sampler. sampler的sampling可以和trainer训练overlap. 秒. 这样就要求RPC是async的.

sampling只应作用于core vertices.

其实有点疑问，cross-partition的graph感觉很难学到啊，因为最多延伸一个节点（HALO vertex)

Mini-batch Trainer

算是明白了. 不过不太懂为啥不事先就balance assign samples to machines呢？？

jasperzhong · 2022-03-16T03:06:55Z

Linear scalability

不影响convergence.

给METIS做了一个ablation study. 看来load balancing很重要.

yzh119 · 2022-04-05T21:05:59Z

后来还有一个v2版的：https://arxiv.org/pdf/2112.15345.pdf

jasperzhong added framework open source gnn labels Mar 15, 2022

jasperzhong self-assigned this Mar 15, 2022

jasperzhong added the gnnsys label Mar 15, 2022

jasperzhong closed this as completed Mar 24, 2022

jasperzhong added the rating (5/5) must read label Mar 31, 2022

jasperzhong reopened this Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICC workshop '20 | DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs #283

ICC workshop '20 | DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs #283

jasperzhong commented Mar 15, 2022 •

edited

Loading

jasperzhong commented Mar 15, 2022 •

edited

Loading

jasperzhong commented Mar 16, 2022 •

edited

Loading

yzh119 commented Apr 5, 2022

ICC workshop '20 | DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs #283

ICC workshop '20 | DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs #283

Comments

jasperzhong commented Mar 15, 2022 • edited Loading

jasperzhong commented Mar 15, 2022 • edited Loading

jasperzhong commented Mar 16, 2022 • edited Loading

yzh119 commented Apr 5, 2022

jasperzhong commented Mar 15, 2022 •

edited

Loading

jasperzhong commented Mar 15, 2022 •

edited

Loading

jasperzhong commented Mar 16, 2022 •

edited

Loading