Fix bug of Multi-Client src tick output order #6221

chengtbf · 2021-09-09T14:26:56Z

修复一个非常重要的 nn.Graph 多卡运行的 BUG：

每个 rank 都会有自己的 WaitAndSendIds 和 CallbackNotify Op/TaskNode 作为该 rank 的唯一 src 和唯一 sink；同时每个 rank 的 VM 都只会往本 rank 才有的 input/output 、WaitAndSendIds 和 CallbackNotify 发送 JobInstance 作为 RunLazyJob instruction 的实际执行结果，参考：

oneflow/oneflow/core/eager/lazy_job_instruction_type.cpp

Line 110 in 7382e4a

buffer_mgr->Get(GetSourceTickBufferName(job_name))->Send(job_instance);

但是由于我们 Single-Client 迁移到 Multi-Client，整个计算图变成了多 source 的（每个 rank 都自己独立的 src 和 sink），以至于我们在逻辑图上即使连上了正确的控制边，物理图上每个 rank 的 input output 并不完全由自己 rank 的 src 触发，以至于在多卡运行时（尤其是流水并行非对称，且输入输出也不对称），计算图中的本 rank 的 output 已经执行了，本 rank 的虚拟机还没有给 src 发 JobInstance ，此时就会报错 OutputKernel ：buffer 为空的 BUG。

oneflow/oneflow/core/kernel/output_kernel.cpp

Line 45 in 7382e4a

CHECK_NE(buffer_status, kBufferStatusEmpty);

本 PR 通过在 TaskGraph 上，给每个 rank 的 src tick 和 input/output 连控制边的方式强制使 output kernel 至少需要依赖本 rank 的 src tick 才能执行。

同时修复了 input/ouput callback 比 callback notify 晚执行的 BUG

leaves-zwx · 2021-09-10T01:11:08Z

这是一种实现思路，在 task graph 层级处理 src op 的控制连边问题。但感觉按照 xinqi 之前的思路，在 op graph 层级 autotick 机制解决这个问题有一定的好处，如果能把相似的功能都用同一种形式表达是能增加维护性和鲁棒性的。负责控制子图的驱动的一直是 tick 机制，这次 multi-client 引入的新问题如果在 tick 的范畴下解决是最好的，另外在 task graph 去控制 tick 连边其实属于特例，会增加以后的维护成本。

目前 tick 按照功能层级逐级相连，一共4种 tick: SourceTickOp -> TickOp -> SrcSubsetTickOp -> DeviceTickOp，sink 部分则是 DeviceTickOp -> DstSubsetTickOp -> TickOp -> SinkTickOp，还有一个 AccTickOp 负责控制不同频的驱动

SourceTickOp/SinkTickOp: 标识 critical section 的起始结束
TickOp: 插在 SourceTickOp 和 SrcSubsetTickOp 之间，作用我不太清楚
SrcSubsetTickOp: 节点(machine)上的起始结束
DeviceTickOp: 同一个 placement group 的起始

multi-client 的问题感觉是引入了新的 src op，之前 src op 的判别都是通过 REGISTER_AUTO_TICK 注册的，属于静态判断，而在 multi-client 下的 src op 应该是动态的，无法静态判断，感觉应该可以通过搜索找到同一个 placement group 下的所有 src op，然后把它们与 DeviceTickOp 连上控制边，来达到控制不同 rank 上的 op 驱动的目的。

chengtbf · 2021-09-10T06:07:31Z

multi-client 的问题感觉是引入了新的 src op，之前 src op 的判别都是通过 REGISTER_AUTO_TICK 注册的，属于静态判断，而在 multi-client 下的 src op 应该是动态的，无法静态判断，感觉应该可以通过搜索找到同一个 placement group 下的所有 src op，然后把它们与 DeviceTickOp 连上控制边，来达到控制不同 rank 上的 op 驱动的目的。

我推演了一下，好像不能在 OpGraph 上连控制边 😂 ，因为可能会没有效果，见：

oneflow/oneflow/core/graph/task_graph.cpp

Line 399 in 82b903b

void ForEachOpGraphNecessaryCtrlEdge(

我们在 OpGraph 上连的控制边 (src -> dst)，如果在 OpGraph 上可达的话（表示有先后的数据依赖），是不会插入控制边的？而 output 和 input，其实是被 src tick 可达的吧，只是不是被本 rank 上的 src tick 可达（逻辑图可达，物理图其实不可达）

leaves-zwx · 2021-09-10T09:31:10Z

我们在 OpGraph 上连的控制边 (src -> dst)，如果在 OpGraph 上可达的话（表示有先后的数据依赖），是不会插入控制边的？而 output 和 input，其实是被 src tick 可达的吧，只是不是被本 rank 上的 src tick 可达（逻辑图可达，物理图其实不可达）

Got 了，看来暂时没法在 op graph 层面操作这件事情。我仔细想了下，你这种方法应该还是比较安全的，后期不会产生什么隐患。至于机制的统一性，我们可以在实现分布式编译的功能的时候再去设计下这个机制。

…tify

chengtbf · 2021-09-10T18:56:20Z

增加了测试用例，这个分支 Ready 了

python/oneflow/test/graph/test_graph_pipeline_delay.py

strint

看了下plan，和实现匹配，我那个pr的4卡测试合并这个执行也正常

strint · 2021-09-11T02:44:24Z

python/oneflow/test/graph/test_graph_pipeline_delay.py

+            loss = pp_out.mean()
+            loss.backward()
+            y = x + y
+            free_out = y.to_consistent(P1, B)


这个没有上下文的估计看不出在做什么，这里加注释说明下应该会更清楚

github-actions · 2021-09-11T03:42:34Z

Speed stats:

GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.4ms (= 6370.3ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 138.9ms (= 6945.9ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 138.9ms / 127.4ms)

OneFlow resnet50 time: 74.2ms (= 3708.6ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.0ms (= 4199.9ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 84.0ms / 74.2ms)

OneFlow resnet50 time: 49.7ms (= 2485.3ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.4ms (= 2868.5ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.15 (= 57.4ms / 49.7ms)

OneFlow resnet50 time: 39.8ms (= 1991.0ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 51.7ms (= 2585.2ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.30 (= 51.7ms / 39.8ms)

OneFlow resnet50 time: 42.1ms (= 2103.6ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.7ms (= 2185.8ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.04 (= 43.7ms / 42.1ms)

OneFlow resnet50 time: 150.5ms (= 7527.2ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 158.9ms (= 7945.8ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.06 (= 158.9ms / 150.5ms)

OneFlow resnet50 time: 100.8ms (= 5041.4ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.5ms (= 5074.7ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.01 (= 101.5ms / 100.8ms)

OneFlow resnet50 time: 81.5ms (= 4077.3ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.0ms (= 4351.8ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.07 (= 87.0ms / 81.5ms)

OneFlow resnet50 time: 76.2ms (= 3808.9ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.5ms (= 4376.7ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 87.5ms / 76.2ms)

OneFlow resnet50 time: 67.4ms (= 3369.9ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.8ms (= 3489.0ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.04 (= 69.8ms / 67.4ms)

github-actions · 2021-09-11T04:40:12Z

Speed stats:

GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.4ms (= 6418.3ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 138.8ms (= 6938.3ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.08 (= 138.8ms / 128.4ms)

OneFlow resnet50 time: 74.7ms (= 3735.6ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.3ms (= 4264.1ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.14 (= 85.3ms / 74.7ms)

OneFlow resnet50 time: 48.0ms (= 2399.2ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 61.1ms (= 3056.3ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.27 (= 61.1ms / 48.0ms)

OneFlow resnet50 time: 39.5ms (= 1977.4ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.8ms (= 2240.6ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.13 (= 44.8ms / 39.5ms)

OneFlow resnet50 time: 40.3ms (= 2016.0ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 40.5ms (= 2022.5ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.00 (= 40.5ms / 40.3ms)

OneFlow resnet50 time: 152.8ms (= 7638.9ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.9ms (= 8045.4ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.05 (= 160.9ms / 152.8ms)

OneFlow resnet50 time: 97.8ms (= 4887.7ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.8ms (= 5287.8ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.08 (= 105.8ms / 97.8ms)

OneFlow resnet50 time: 76.2ms (= 3810.3ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.4ms (= 3671.4ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.96 (= 73.4ms / 76.2ms)

OneFlow resnet50 time: 69.4ms (= 3467.9ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 63.8ms (= 3189.2ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.92 (= 63.8ms / 69.4ms)

OneFlow resnet50 time: 67.4ms (= 3370.3ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 62.2ms (= 3108.9ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.92 (= 62.2ms / 67.4ms)

Fix bug of Multi-Client src tick output order

96c5a18

chengtbf added bottleneck blocking another feature/PR automerge bug system interface labels Sep 9, 2021

chengtbf requested review from lixinqi, strint, leaves-zwx and oneflow-ci-bot September 9, 2021 14:26

Merge branch 'master' into dev_cc_fix_src_output_order

a96a643

Merge branch 'master' into dev_cc_fix_src_output_order

54e2d01

strint mentioned this pull request Sep 10, 2021

Fea/pipeline in graph #6105

Merged

3 tasks

chengtbf added 3 commits September 11, 2021 01:19

Merge branch 'master' into dev_cc_fix_src_output_order

464ccef

Add input/output ctrl edge to DstSubTick for order io and callback_no…

4f62da6

…tify

add test scripts

c008b3b

remove note

32aa2c3

strint reviewed Sep 11, 2021

View reviewed changes

python/oneflow/test/graph/test_graph_pipeline_delay.py Show resolved Hide resolved

strint approved these changes Sep 11, 2021

View reviewed changes

chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 11, 2021 02:41

auto format by CI

8e9edd9

strint reviewed Sep 11, 2021

View reviewed changes

leaves-zwx approved these changes Sep 11, 2021

View reviewed changes

add note of sleep

daecbbf

chengtbf removed the request for review from oneflow-ci-bot September 11, 2021 02:58

chengtbf requested a review from oneflow-ci-bot September 11, 2021 02:58

auto format by CI

b063a04

chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 11, 2021 03:13

oneflow-ci-bot removed their request for review September 11, 2021 04:12

chengtbf requested a review from oneflow-ci-bot September 11, 2021 04:24

chengtbf added automerge and removed bottleneck blocking another feature/PR automerge labels Sep 11, 2021

oneflow-ci-bot merged commit 788bd1a into master Sep 11, 2021

oneflow-ci-bot deleted the dev_cc_fix_src_output_order branch September 11, 2021 04:59

chengtbf mentioned this pull request Sep 12, 2021

[Draft ONLY] nn.Graph asymmetric input/output debug runnable #6236

Closed

3 tasks

chengtbf mentioned this pull request Nov 26, 2021

fix pipeline delay ctrl edge between src subset tick and output #6881

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug of Multi-Client src tick output order #6221

Fix bug of Multi-Client src tick output order #6221

chengtbf commented Sep 9, 2021 •

edited

Loading

leaves-zwx commented Sep 10, 2021

chengtbf commented Sep 10, 2021 •

edited

Loading

leaves-zwx commented Sep 10, 2021

chengtbf commented Sep 10, 2021

strint left a comment

strint Sep 11, 2021

chengtbf Sep 11, 2021

github-actions bot commented Sep 11, 2021

github-actions bot commented Sep 11, 2021

Fix bug of Multi-Client src tick output order #6221

Fix bug of Multi-Client src tick output order #6221

Conversation

chengtbf commented Sep 9, 2021 • edited Loading

leaves-zwx commented Sep 10, 2021

chengtbf commented Sep 10, 2021 • edited Loading

leaves-zwx commented Sep 10, 2021

chengtbf commented Sep 10, 2021

strint left a comment

Choose a reason for hiding this comment

strint Sep 11, 2021

Choose a reason for hiding this comment

chengtbf Sep 11, 2021

Choose a reason for hiding this comment

github-actions bot commented Sep 11, 2021

github-actions bot commented Sep 11, 2021

chengtbf commented Sep 9, 2021 •

edited

Loading

chengtbf commented Sep 10, 2021 •

edited

Loading