Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partitioning for distributed training #7502

Merged
merged 23 commits into from
Jun 16, 2023

Conversation

ZhengHongming888
Copy link
Contributor

@ZhengHongming888 ZhengHongming888 commented Jun 4, 2023

This code belongs to the part of the whole distributed training for PyG.

This class (partitioner.py) will implement

  1. the partition algorithm based on pyg's clusterData
  2. in each partition LocalGraphStore/LocalFeatureStore will be used to initialize the graph & node/edge feature data
  3. each partition also contains the partition information book/map between node/edge ids and partition id
  4. each of which above will be further saved as .pt file folders include graph/node_feat/edge_feat/labels/node_map/edge_map.

The partition folders as below-

  • homo graph
    output_dir/
    |-- META.json
    |-- node_map.pt
    |-- edge_map.pt
    |-- part0/
             |-- graph.pt
             |-- node_feats.pt
             |-- edge_feats.pt
  |-- part1/
             |-- graph.pt
             |-- node_feats.pt
             |-- edge_feats.pt

* hetero graph
  output_dir/
  |-- META.json
  |-- node_map/
             |-- ntype1.pt
             |-- ntype2.pt
  |-- edge_map/
            |-- etype1.pt
            |-- etype2.pt
  |-- part0/
            |-- graph.pt
            |-- node_feats.pt
            |-- edge_feats.pt
  |-- part1/
           |-- graph.pt
           |-- node_feats.pt
           |-- edge_feats.pt

We also provide two example codes to help generate the homo/hetero graph partition based on ogbn-products/ogbn-mags under example/distributed folder.

One unit test code under /test folder is used to verify this partition algorithm based on FakeDataset/FakeHeteroDataset.

Any comments please let us know. thanks

ZhengHongming888 and others added 6 commits June 2, 2023 09:15
add partition part support for distributed training
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
… test case

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
add example for graph partition; change ogb dataset to fakedataset in…
@codecov
Copy link

codecov bot commented Jun 4, 2023

Codecov Report

Merging #7502 (f8c98e4) into master (5c72f33) will decrease coverage by 0.29%.
The diff coverage is 99.10%.

❗ Current head f8c98e4 differs from pull request most recent head 89f8eb8. Consider uploading reports for the commit 89f8eb8 to get more accurate results

@@            Coverage Diff             @@
##           master    #7502      +/-   ##
==========================================
- Coverage   91.74%   91.45%   -0.29%     
==========================================
  Files         450      451       +1     
  Lines       25161    25270     +109     
==========================================
+ Hits        23084    23111      +27     
- Misses       2077     2159      +82     
Impacted Files Coverage Δ
torch_geometric/distributed/partition.py 99.09% <99.09%> (ø)
torch_geometric/distributed/__init__.py 100.00% <100.00%> (ø)

... and 17 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@rusty1s rusty1s changed the title Add Partition for distributed training Add partitioning for distributed training Jun 4, 2023
@kaixuanliu
Copy link
Contributor

Hi @rusty1s , I added a unit test for graph partitioning, while it returns error ImportError: 'ClusterData' requires either 'pyg-lib' or 'torch-sparse' What should I do to avoid this?

@akihironitta
Copy link
Member

@kaixuanliu You can place the WithPackage decorator to skip the test case when those optional packages are not present in the environment:

@withPackage('torch_scatter')
@pytest.mark.parametrize('reduce', ['sum', 'add', 'mean', 'min', 'max'])
def test_scatter(reduce, device):

kaixuanliu and others added 2 commits June 7, 2023 19:39
…ion in example

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
add CHANGELOG; fix unit test error; add train_idx and test_idx partit…
ZhengHongming888 and others added 11 commits June 13, 2023 19:43
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
adapt to new implementation of LocalFeatureStore
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I cleaned this up a bit. Especially, Partitioner no longer stores LocalFeatureStore and LocalGraphStore instances, as it is bad practice to pickle arbitrary Python objects. Instead, it saves Python dictionaries now, so please construct LocalFeatureStore and LocalGraphStore instances from this when loading the data from disk.

Otherwise, looks good. I would be in favor of just adding the homogeneous code path though. I am not really 100% confident the heterogeneous code path is correct.

@rusty1s rusty1s enabled auto-merge (squash) June 16, 2023 14:09
@rusty1s rusty1s merged commit 8858f50 into pyg-team:master Jun 16, 2023
@kaixuanliu
Copy link
Contributor

Thank you. I cleaned this up a bit. Especially, Partitioner no longer stores LocalFeatureStore and LocalGraphStore instances, as it is bad practice to pickle arbitrary Python objects. Instead, it saves Python dictionaries now, so please construct LocalFeatureStore and LocalGraphStore instances from this when loading the data from disk.

Otherwise, looks good. I would be in favor of just adding the homogeneous code path though. I am not really 100% confident the heterogeneous code path is correct.

Thanks Matthias! It is a better practice to replace LocalFeatureStore/LocalGraphStore with python dictionary, as it does not need customized data structure. For heterogeneous graph partition, we have checked the output of partition for ogbn-mags dataset, and we will further validate its correctness in later development for hetero graph distributed training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants