We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
https://arxiv.org/pdf/2010.05337
The text was updated successfully, but these errors were encountered:
和之前大模型训练很不一样. 分布式GNN训练主要是图太大,vertex直接有非常复杂的依赖关系,而传统的训练sample之间都是互相独立的. 所以这里的挑战是如何.
解决办法看上去不是很复杂. 首先trainer调用RPC让sampler去采样,返回sampled subgraph,然后trainer去存放node features的KV Store去取对应的node features,然后进行data-parallel训练. 如下图所示.
graph首先是被paritition成多个subgraph存放在各个机器上,vertex/edge features也随之partition. 每个机器会有一个graph sampler负责其机器上的subgraph的采样.
原来如此,那么看来采样这件事情是基本是locally的.
graph partitioning partition算法的目的是让cross partition的edges数量最少. 这是事先做一次的. 并且会把cross-partition的edge的vertex两边都进行copy. 所以整个系统中,edge只有一份,而vertex可能会重复. 重复的vertex叫做HALO vertices,其他的叫做core vertices. 下面是一个示意图.
partition graph一个问题是load balancing. 他们formulate成一个multi-constraint partitioning问题,没形式化.
partition graph后,vertex features和edge features也随之partition. 但是,HALO vertices的features不会duplicated. 这样,所有的vertex features和edge features都不会duplicated.
Distributed KV-Store 内部用shared memory作为IPC. 还是会有跨机通信.
Distributed Sampler trainer用RPC请求sampler. sampler的sampling可以和trainer训练overlap. 秒. 这样就要求RPC是async的.
sampling只应作用于core vertices.
其实有点疑问,cross-partition的graph感觉很难学到啊,因为最多延伸一个节点(HALO vertex)
Mini-batch Trainer
算是明白了. 不过不太懂为啥不事先就balance assign samples to machines呢??
Sorry, something went wrong.
Linear scalability
不影响convergence.
给METIS做了一个ablation study. 看来load balancing很重要.
后来还有一个v2版的:https://arxiv.org/pdf/2112.15345.pdf
jasperzhong
No branches or pull requests
https://arxiv.org/pdf/2010.05337
The text was updated successfully, but these errors were encountered: