Add partitioning for distributed training #7502

ZhengHongming888 · 2023-06-04T00:23:53Z

This code belongs to the part of the whole distributed training for PyG.

This class (partitioner.py) will implement

the partition algorithm based on pyg's clusterData
in each partition LocalGraphStore/LocalFeatureStore will be used to initialize the graph & node/edge feature data
each partition also contains the partition information book/map between node/edge ids and partition id
each of which above will be further saved as .pt file folders include graph/node_feat/edge_feat/labels/node_map/edge_map.

The partition folders as below-

homo graph
output_dir/
|-- META.json
|-- node_map.pt
|-- edge_map.pt
|-- part0/

             |-- graph.pt
             |-- node_feats.pt
             |-- edge_feats.pt

  |-- part1/
             |-- graph.pt
             |-- node_feats.pt
             |-- edge_feats.pt

* hetero graph
  output_dir/
  |-- META.json
  |-- node_map/
             |-- ntype1.pt
             |-- ntype2.pt
  |-- edge_map/
            |-- etype1.pt
            |-- etype2.pt
  |-- part0/
            |-- graph.pt
            |-- node_feats.pt
            |-- edge_feats.pt
  |-- part1/
           |-- graph.pt
           |-- node_feats.pt
           |-- edge_feats.pt

We also provide two example codes to help generate the homo/hetero graph partition based on ogbn-products/ogbn-mags under example/distributed folder.

One unit test code under /test folder is used to verify this partition algorithm based on FakeDataset/FakeHeteroDataset.

Any comments please let us know. thanks

add partition part support for distributed training

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

fix bug for put_edge_id

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

… test case Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

add example for graph partition; change ogb dataset to fakedataset in…

codecov · 2023-06-04T00:30:35Z

Codecov Report

Merging #7502 (f8c98e4) into master (5c72f33) will decrease coverage by 0.29%.
The diff coverage is 99.10%.

❗ Current head f8c98e4 differs from pull request most recent head 89f8eb8. Consider uploading reports for the commit 89f8eb8 to get more accurate results

@@            Coverage Diff             @@
##           master    #7502      +/-   ##
==========================================
- Coverage   91.74%   91.45%   -0.29%     
==========================================
  Files         450      451       +1     
  Lines       25161    25270     +109     
==========================================
+ Hits        23084    23111      +27     
- Misses       2077     2159      +82

Impacted Files	Coverage Δ
torch_geometric/distributed/partition.py	`99.09% <99.09%> (ø)`
torch_geometric/distributed/__init__.py	`100.00% <100.00%> (ø)`

... and 17 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

kaixuanliu · 2023-06-07T03:41:18Z

Hi @rusty1s , I added a unit test for graph partitioning, while it returns error ImportError: 'ClusterData' requires either 'pyg-lib' or 'torch-sparse' What should I do to avoid this?

akihironitta · 2023-06-07T10:11:46Z

@kaixuanliu You can place the WithPackage decorator to skip the test case when those optional packages are not present in the environment:

pytorch_geometric/test/utils/test_scatter.py

Lines 25 to 27 in 0f0e0da

    
           @withPackage('torch_scatter') 
        
           @pytest.mark.parametrize('reduce', ['sum', 'add', 'mean', 'min', 'max']) 
        
           def test_scatter(reduce, device):

…ion in example Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

add CHANGELOG; fix unit test error; add train_idx and test_idx partit…

examples/distributed/partition_graph.py

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

for more information, see https://pre-commit.ci

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

adapt to new implementation of LocalFeatureStore

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

delete partition example temporarily

rusty1s

Thank you. I cleaned this up a bit. Especially, Partitioner no longer stores LocalFeatureStore and LocalGraphStore instances, as it is bad practice to pickle arbitrary Python objects. Instead, it saves Python dictionaries now, so please construct LocalFeatureStore and LocalGraphStore instances from this when loading the data from disk.

Otherwise, looks good. I would be in favor of just adding the homogeneous code path though. I am not really 100% confident the heterogeneous code path is correct.

kaixuanliu · 2023-06-16T14:35:03Z

Thank you. I cleaned this up a bit. Especially, Partitioner no longer stores LocalFeatureStore and LocalGraphStore instances, as it is bad practice to pickle arbitrary Python objects. Instead, it saves Python dictionaries now, so please construct LocalFeatureStore and LocalGraphStore instances from this when loading the data from disk.

Otherwise, looks good. I would be in favor of just adding the homogeneous code path though. I am not really 100% confident the heterogeneous code path is correct.

Thanks Matthias! It is a better practice to replace LocalFeatureStore/LocalGraphStore with python dictionary, as it does not need customized data structure. For heterogeneous graph partition, we have checked the output of partition for ogbn-mags dataset, and we will further validate its correctness in later development for hetero graph distributed training.

ZhengHongming888 and others added 6 commits June 2, 2023 09:15

Merge pull request #1 from kaixuanliu/master

f4b6a55

add partition part support for distributed training

add partition part support for distributed training

129d6eb

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge pull request #2 from kaixuanliu/master

ce92fc1

fix bug for put_edge_id

fix bug for put_edge_id

03403bf

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

add example for graph partition; change ogb dataset to fakedataset in…

50238e7

… test case Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge pull request #3 from kaixuanliu/master

dbc2e8f

add example for graph partition; change ogb dataset to fakedataset in…

ZhengHongming888 requested review from wsad1 and rusty1s as code owners June 4, 2023 00:23

rusty1s assigned ZhengHongming888 Jun 4, 2023

rusty1s added feature 0 - Priority P0 distributed labels Jun 4, 2023

rusty1s changed the title ~~Add Partition for distributed training~~ Add partitioning for distributed training Jun 4, 2023

kaixuanliu and others added 2 commits June 7, 2023 19:39

add CHANGELOG; fix unit test error; add train_idx and test_idx partit…

6342dcd

…ion in example Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge pull request #4 from kaixuanliu/dist_partition

b6484ce

add CHANGELOG; fix unit test error; add train_idx and test_idx partit…

rusty1s reviewed Jun 12, 2023

View reviewed changes

examples/distributed/partition_graph.py Outdated Show resolved Hide resolved

examples/distributed/partition_graph.py Outdated Show resolved Hide resolved

ZhengHongming888 and others added 11 commits June 13, 2023 19:43

Merge branch 'master' into dist_partition

5402762

adapt to new implementation of LocalFeatureStore

a0ce3db

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

640bcd8

for more information, see https://pre-commit.ci

change back to original design

d805992

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

small change to homo graph

ed20b1b

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge pull request #5 from kaixuanliu/dist_partition

5ea24ca

adapt to new implementation of LocalFeatureStore

Merge branch 'master' into dist_partition

545de3c

update

66ff4d1

update

530df3a

update

94e30b2

delete partition example temporarily

db889fe

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

ZhengHongming888 and others added 3 commits June 15, 2023 20:57

Merge pull request #6 from kaixuanliu/dist_partition

8032400

delete partition example temporarily

update

18edf19

update

f8c98e4

rusty1s approved these changes Jun 16, 2023

View reviewed changes

update

89f8eb8

rusty1s enabled auto-merge (squash) June 16, 2023 14:09

rusty1s merged commit 8858f50 into pyg-team:master Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partitioning for distributed training #7502

Add partitioning for distributed training #7502

ZhengHongming888 commented Jun 4, 2023 •

edited

Loading

codecov bot commented Jun 4, 2023 •

edited

Loading

kaixuanliu commented Jun 7, 2023

akihironitta commented Jun 7, 2023

rusty1s left a comment

kaixuanliu commented Jun 16, 2023

Add partitioning for distributed training #7502

Add partitioning for distributed training #7502

Conversation

ZhengHongming888 commented Jun 4, 2023 • edited Loading

codecov bot commented Jun 4, 2023 • edited Loading

Codecov Report

kaixuanliu commented Jun 7, 2023

akihironitta commented Jun 7, 2023

rusty1s left a comment

Choose a reason for hiding this comment

kaixuanliu commented Jun 16, 2023

ZhengHongming888 commented Jun 4, 2023 •

edited

Loading

codecov bot commented Jun 4, 2023 •

edited

Loading