Feat/logical nccl send recv #8318

strint · 2022-05-26T17:15:41Z

Insert send recv in nccl logical pass to enable memory reuse for special SBP.

strint · 2022-05-27T08:44:34Z

oneflow/core/job_rewriter/insert_nccl_logical_op_pass.cpp

@@ -366,7 +386,7 @@ bool TryBuildNcclLogicalOpConf(OperatorConf* ret, const OpNode* src_node, const
  std::shared_ptr<Shape> dst_reduced_hierarchy = dst_reduced_parallel_desc->hierarchy();

  if ((*src_reduced_hierarchy) == (*dst_reduced_hierarchy)
-      && src_reduced_nd_sbp == dst_reduced_nd_sbp) {
+      && (*src_reduced_nd_sbp) == (*dst_reduced_nd_sbp)) {


fix ndsbp equal

这个 BUG 之前居然没遇到 😂

strint · 2022-05-27T08:46:05Z

oneflow/core/job_rewriter/insert_nccl_logical_op_pass.cpp

+    if (!got_nccl && ParseBooleanFromEnv("LOGICAL_SR", false)) {
+      got_nccl = TryBuildNcclBy2DHierarchyOthers(ret, *src_reduced_nd_sbp, *dst_reduced_nd_sbp,
+                                               src_reduced_hierarchy, lbn, scope_symbol_id,
+                                               logical_blob_desc);


Insert nccl logical send recv

strint · 2022-05-27T08:51:24Z

oneflow/core/job/nd_sbp_util.cpp

+// Go through all the ranks while transfer between two nd sbps with no PartialSum under the same
+// placement.
+// NOTE: We need to make sure no partial sums in the sbps of the producer and consumer.
+void DfsTraverseRanks4NdSbp(


Borrowed from #7936 by @guo-ran

strint · 2022-05-30T03:02:51Z

python/oneflow/test/graph/test_nccl_logical_send_recv.py

+    eager_out = x.to_global(sbp=dst_nd_sbp, placement=placement)
+    test_case.assertTrue(np.array_equal(eager_out.numpy(), x.numpy()))
+
+    # bad case of graph: S with P


guo-ran · 2022-05-30T04:50:23Z

oneflow/user/kernels/nccl_logical_send_recv_kernel.cpp

+    // Note: when src_nd_sbp has partial_sum, need a out_size buffer to copy and add to out.
+    buf_count += out_shape->elem_cnt();
+  }
+  return buf_count;


这里错了，这里应该乘GetSizeOfDataType(data_type)

fixed. 现在单测和集成测试都正常了。

strint · 2022-05-31T01:38:22Z

oneflow/core/auto_parallel/boxing_collector.cpp

@@ -493,6 +496,20 @@ Maybe<void> BoxingCollector::AskSbpCombination(const NdSbp& sbp_producer, const
  if (ParseBooleanFromEnv("ONEFLOW_BOXING_DISABLE_MIDDLE_NODE_AND_CHECK", false)) {
    return Maybe<void>::Ok();
  }
+  // If compute_cost==false + 2D sbp + same placment + nccl logical + not (p->b),


这种条件下，middle node 不工作，交给 nccl logical send recv

src 中有 p， dst 中有 b时，不使用 send recv 而使用 middle node，发现 send recv 测试中存在无法通过的测试例子，在send recv测试中关闭 nccl use compute stream 就能复现问题。

所以这里 (p->b) 还是使用 send recv。存在的问题，后面的 PR 来修复。

这里解释一下存在的问题，2个：

传输速度由O(n+m)变为O(nm)，n是上游节点P的维度分量乘积，m是下游节点B的维度分量乘积。
例如: [2, 2]: (P, S0) -> [2, 2]: (B, B), O(2+4) -> O(2*4), 卡数少的时候差不了多少。

最后的B在各卡上数据不一致，会有1e-15的相对误差。

strint · 2022-05-31T01:43:38Z

oneflow/ir/include/OneFlow/OneFlowUserOps.td

@@ -5329,6 +5329,24 @@ def OneFlow__ncclLogicalS2sOp : OneFlow_BaseOp<"_nccl_logical_s2s", [NoSideEffec
  let has_nd_sbp_infer_fn = 1;
 }

+def OneFlow__ncclLogicalSendRecvOp : OneFlow_BaseOp<"_nccl_logical_send_recv", [NoSideEffect, NoGrad, DeclareOpInterfaceMethods<UserOpCompatibleInterface>]> {


nccl logical send recv op

strint · 2022-05-31T01:44:01Z

oneflow/user/kernels/nccl_logical_send_recv_kernel.cpp

+  bool AlwaysComputeWhenAllOutputsEmpty() const override { return false; }
+};
+
+void NcclLogicalSendRecv::Compute(user_op::KernelComputeContext* ctx, user_op::OpKernelState* state,


nccl logical send recv kernel

strint · 2022-05-31T01:45:04Z

python/oneflow/test/graph/test_nccl_logical_send_recv.py

+
+    # check eager boxing
+    eager_out = x.to_global(sbp=dst_nd_sbp, placement=placement)
+    test_case.assertTrue(np.array_equal(eager_out.numpy(), x.numpy()))


确保 eager 是正确的，因为后面 graph 的输出会调用 eager numpy()，里面隐式的调用了 dst_nd_sbp -> B

strint · 2022-05-31T01:47:03Z

python/oneflow/test/graph/test_nccl_logical_send_recv.py

+    test_case.assertTrue(np.array_equal(eager_out.numpy(), x.numpy()))
+
+	# check graph boxing
+    flow.boxing.nccl.enable_use_compute_stream(True)


打开 nccl logical 开关

strint · 2022-05-31T01:47:13Z

python/oneflow/test/graph/test_nccl_logical_send_recv.py

+    #if flow.env.get_rank() == 0:
+    #    print("src sbp ", src_nd_sbp, ", dst sbp ", dst_nd_sbp)
+
+    test_case.assertTrue(np.array_equal(out_np, in_np))


graph check

github-actions · 2022-05-31T14:25:48Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2022-05-31T16:31:45Z

Speed stats:

github-actions · 2022-05-31T16:42:11Z

Static analysis with clang failed. PR label automerge has been removed

github-actions · 2022-05-31T18:18:58Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2022-05-31T20:29:15Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8318/

github-actions · 2022-05-31T20:33:32Z

CI failed when running job: cuda-module. PR label automerge has been removed

github-actions · 2022-05-31T20:34:05Z

Speed stats:

strint · 2022-06-01T02:31:51Z

CI failed when running job: cuda-module. PR label automerge has been removed

https://github.com/Oneflow-Inc/oneflow/runs/6678073040?check_suite_focus=true

test_case = <test_module.TestModule testMethod=test_module_setattr>

    @flow.unittest.skip_unless_1n1d()
    def test_module_setattr(test_case):
        class CustomModule(flow.nn.Module):
            def __init__(self, param1, param2):
                super().__init__()
                self.param1 = param1
                self.param2 = param2
    
        param0 = flow.nn.Parameter(flow.Tensor(2, 3))
        param1 = flow.nn.Parameter(flow.Tensor(2, 3))
        param2 = CustomModule(param0, param1)
        m = CustomModule(param1, param2)
        params = list(m.parameters())
        test_case.assertEqual(len(params), 2)
    
        test_case.assertTrue(
            np.allclose(params[0].numpy(), param1.numpy(), atol=1e-4, rtol=1e-4)
        )
        test_case.assertTrue(
>           np.allclose(params[1].numpy(), param0.numpy(), atol=1e-4, rtol=1e-4)
        )
E       AssertionError: False is not true

一个无关的单测，本地没有复现。先重跑一下。

github-actions · 2022-06-01T03:09:05Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 130.5ms (= 13050.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 147.0ms (= 14700.5ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.13 (= 147.0ms / 130.5ms)

OneFlow resnet50 time: 76.9ms (= 7692.9ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 86.8ms (= 8684.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 86.8ms / 76.9ms)

OneFlow resnet50 time: 54.7ms (= 10931.0ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 56.7ms (= 11335.9ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.04 (= 56.7ms / 54.7ms)

OneFlow resnet50 time: 41.2ms (= 8244.8ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 43.6ms (= 8713.9ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.06 (= 43.6ms / 41.2ms)

OneFlow resnet50 time: 37.8ms (= 7560.0ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 37.3ms (= 7458.3ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.99 (= 37.3ms / 37.8ms)

OneFlow swin dataloader time: 0.242s (= 48.458s / 200, num_workers=1)
PyTorch swin dataloader time: 0.152s (= 30.375s / 200, num_workers=1)
Relative speed: 0.627 (= 0.152s / 0.242s)

OneFlow swin dataloader time: 0.067s (= 13.369s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.393s / 200, num_workers=4)
Relative speed: 0.628 (= 0.042s / 0.067s)

OneFlow swin dataloader time: 0.035s (= 7.040s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.481s / 200, num_workers=8)
Relative speed: 0.636 (= 0.022s / 0.035s)

❌ OneFlow resnet50 time: 146.4ms (= 14640.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 173.4ms (= 17342.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 173.4ms / 146.4ms)

OneFlow resnet50 time: 96.9ms (= 9690.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.5ms (= 11250.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 112.5ms / 96.9ms)

OneFlow resnet50 time: 71.1ms (= 14230.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.6ms (= 17712.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.24 (= 88.6ms / 71.1ms)

OneFlow resnet50 time: 60.4ms (= 12088.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.6ms (= 14920.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.23 (= 74.6ms / 60.4ms)

OneFlow resnet50 time: 54.4ms (= 10886.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.5ms (= 14296.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 71.5ms / 54.4ms)

chengtbf · 2023-02-15T06:00:17Z

oneflow/user/kernels/nccl_logical_send_recv_kernel.cpp

+  bool has_independent_stream_;
+  std::string stream_name_;
+  std::unique_ptr<ParallelDesc> parallel_desc_;
+  mutable std::unique_ptr<Comm> comm_;


这里为啥是 mutable ？

ncclComm_t comm() const { return GetOrCreateComm().comm; }

印象中好像是这样的 const 接口，会改动 comm_

那其实是不是就不需要作为 const 接口呢？ 😂

参考其他的 nccl logical kernel：

oneflow/oneflow/user/kernels/nccl_logical_2d_sbp_kernels.cpp

Line 42 in b086395

ncclComm_t comm() {

是的，看了下，可以去掉 const

strint added 2 commits May 26, 2022 22:26

move to base with master

2011e2c

fix

ffe2094

strint requested review from hjchen2, BBuf and jackalcooper as code owners May 26, 2022 17:15

strint commented May 27, 2022

View reviewed changes

add test

cce8efd

strint requested a review from daquexian as a code owner May 27, 2022 09:14

strint added 5 commits May 27, 2022 22:41

debug bad case

a30b0c0

refine test for eager and graph boxing

c73013f

test case ready

08b1f69

simplify

821a8f4

refine test

29079a0

strint commented May 30, 2022

View reviewed changes

guo-ran reviewed May 30, 2022

View reviewed changes

fix buff size

e49d380

strint added graph graph mode global op feature labels May 30, 2022

strint added 2 commits May 30, 2022 19:59

middle node skip for send recv

f0bef93

nccl send recv with middle node

e4a0dda

strint commented May 31, 2022

View reviewed changes

oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 31, 2022 15:51

github-actions bot removed the automerge label May 31, 2022

fix static check

e87a76d

strint added the automerge label May 31, 2022

Merge branch 'master' into feat/logical_nccl_send_recv

f98f1f6

strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 31, 2022 18:17

auto format by CI

f985876

oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 31, 2022 19:21

github-actions bot removed the automerge label May 31, 2022

strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot June 1, 2022 02:33

strint added the automerge label Jun 1, 2022

mergify bot merged commit 0e42dca into master Jun 1, 2022

mergify bot deleted the feat/logical_nccl_send_recv branch June 1, 2022 03:38

Yipeng1994 mentioned this pull request Jan 10, 2023

Feat general basic communication #8437

Merged

chengtbf reviewed Feb 15, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/logical nccl send recv #8318

Feat/logical nccl send recv #8318

strint commented May 26, 2022 •

edited

Loading

strint May 27, 2022

chengtbf May 31, 2022

strint May 27, 2022

strint May 27, 2022 •

edited

Loading

strint May 30, 2022

guo-ran May 30, 2022

strint May 30, 2022

strint May 31, 2022

strint May 31, 2022

Yipeng1994 May 31, 2022

strint May 31, 2022 •

edited

Loading

strint May 31, 2022

strint May 31, 2022 •

edited

Loading

strint May 31, 2022

strint May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

strint commented Jun 1, 2022

github-actions bot commented Jun 1, 2022

chengtbf Feb 15, 2023

strint Feb 15, 2023

chengtbf Feb 15, 2023

strint Feb 15, 2023

Feat/logical nccl send recv #8318

Feat/logical nccl send recv #8318

Conversation

strint commented May 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

strint May 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

strint May 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

strint May 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

github-actions bot commented May 31, 2022

strint commented Jun 1, 2022

github-actions bot commented Jun 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

strint commented May 26, 2022 •

edited

Loading

strint May 27, 2022 •

edited

Loading

strint May 31, 2022 •

edited

Loading

strint May 31, 2022 •

edited

Loading