[Feat.] NNGraph new eager tensor for new variable created in JobPass #6091

chengtbf · 2021-08-28T16:30:40Z

重要功能和 BUG 修复：

NNGraph 支持为 JobPass 里创建的 Variable Conf ：

构造新的 EagerTensor，
并初始化，
并将该 Tensor 存储在 SessionContext 中，
并在 Runtime 启动以后绑定给 Regst

已在本地验证过 AMP 单卡和多卡（2卡）的训练正常

该PR 解决了 AMP 训练 DynamicLossScale 初始值为 0 的问题（原本应为 2^30）。

之前没有为这些 JobPass 中的 Variable 创建 Tensor 并初始化，但恰巧绝大多数这类的 Variable （比如 BN 的 model.bn1.weight-momentum）的 initializer 都是 constant 0 ，所以没有引发实际训练的问题。本 PR 修复此 BUG。

TODO：

nn.Graph 支持 save/load
nn.Graph 需要把这些创建的 Variable （如 loss scale、good step counter 等）的 Tensor 也记录在自己的 state dict 中，并支持 save 和 load 这些 Variable

…d in JobPass.

github-actions · 2021-08-28T18:58:30Z

CI failed, removing label automerge

github-actions · 2021-08-28T19:39:41Z

Speed stats:

GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.1ms (= 6404.0ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 143.4ms (= 7170.2ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 143.4ms / 128.1ms)

OneFlow resnet50 time: 74.7ms (= 3737.4ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 82.7ms (= 4135.7ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.11 (= 82.7ms / 74.7ms)

OneFlow resnet50 time: 48.4ms (= 2421.3ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 55.0ms (= 2748.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.14 (= 55.0ms / 48.4ms)

OneFlow resnet50 time: 42.7ms (= 2132.9ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 47.7ms (= 2383.4ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 47.7ms / 42.7ms)

OneFlow resnet50 time: 40.5ms (= 2026.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 43.3ms (= 2163.1ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.07 (= 43.3ms / 40.5ms)

OneFlow resnet50 time: 144.3ms (= 7216.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 150.0ms (= 7497.9ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.04 (= 150.0ms / 144.3ms)

OneFlow resnet50 time: 94.0ms (= 4701.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 92.6ms (= 4628.5ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.98 (= 92.6ms / 94.0ms)

OneFlow resnet50 time: 67.0ms (= 3348.4ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 62.4ms (= 3117.8ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.93 (= 62.4ms / 67.0ms)

OneFlow resnet50 time: 62.7ms (= 3133.5ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 47.8ms (= 2392.2ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.76 (= 47.8ms / 62.7ms)

OneFlow resnet50 time: 55.5ms (= 2776.7ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 49.1ms (= 2452.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.88 (= 49.1ms / 55.5ms)

github-actions · 2021-08-28T20:02:52Z

Speed stats:

GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.0ms (= 6400.5ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 140.4ms (= 7019.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.10 (= 140.4ms / 128.0ms)

OneFlow resnet50 time: 74.5ms (= 3723.9ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 85.6ms (= 4278.8ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.15 (= 85.6ms / 74.5ms)

OneFlow resnet50 time: 47.6ms (= 2378.0ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 58.4ms (= 2918.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.23 (= 58.4ms / 47.6ms)

OneFlow resnet50 time: 45.4ms (= 2267.6ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 50.9ms (= 2545.1ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 50.9ms / 45.4ms)

OneFlow resnet50 time: 42.3ms (= 2113.7ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 44.3ms (= 2215.2ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.05 (= 44.3ms / 42.3ms)

OneFlow resnet50 time: 142.8ms (= 7139.0ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 150.3ms (= 7514.0ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.05 (= 150.3ms / 142.8ms)

OneFlow resnet50 time: 94.7ms (= 4733.4ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 94.6ms (= 4732.0ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.00 (= 94.6ms / 94.7ms)

OneFlow resnet50 time: 67.4ms (= 3367.5ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 62.3ms (= 3116.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.93 (= 62.3ms / 67.4ms)

OneFlow resnet50 time: 60.9ms (= 3046.1ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 51.4ms (= 2569.6ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.84 (= 51.4ms / 60.9ms)

OneFlow resnet50 time: 60.8ms (= 3038.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 50.8ms (= 2540.0ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.84 (= 50.8ms / 60.8ms)

chengtbf · 2021-08-29T03:08:35Z

本地测试日志：

-------------------- end of arguments ---------------------
***** Model Init *****
***** Model Init Finish, time escapled: 0.12776 s *****
 cclog: loss_scale = 1.07374e+09
rank: 1, epoch: 0, iter: 1, job: train, loss: 7.06198, top1: 0.00000, 
rank: 0, epoch: 0, iter: 1, job: train, loss: 7.06198, top1: 0.00000, 
 cclog: loss_scale = 5.36871e+08
rank: 1, epoch: 0, iter: 2, job: train, loss: 6.98842, top1: 0.00000, 
rank: 0, epoch: 0, iter: 2, job: train, loss: 6.98842, top1: 0.00000, 
 cclog: loss_scale = 2.68435e+08
rank: 1, epoch: 0, iter: 3, job: train, loss: 6.98012, top1: 0.00000, 
rank: 0, epoch: 0, iter: 3, job: train, loss: 6.98012, top1: 0.00000, 
 cclog: loss_scale = 1.34218e+08
rank: 1, epoch: 0, iter: 4, job: train, loss: 7.07354, top1: 0.00000, 
rank: 0, epoch: 0, iter: 4, job: train, loss: 7.07354, top1: 0.00000, 
 cclog: loss_scale = 6.71089e+07
rank: 1, epoch: 0, iter: 5, job: train, loss: 7.06426, top1: 0.00000, 
rank: 0, epoch: 0, iter: 5, job: train, loss: 7.06426, top1: 0.00000, 
 cclog: loss_scale = 3.35544e+07
rank: 1, epoch: 0, iter: 6, job: train, loss: 7.06677, top1: 0.00000, 
rank: 0, epoch: 0, iter: 6, job: train, loss: 7.06677, top1: 0.00000, 
 cclog: loss_scale = 1.67772e+07
rank: 1, epoch: 0, iter: 7, job: train, loss: 7.08927, top1: 0.00000, 
rank: 0, epoch: 0, iter: 7, job: train, loss: 7.08927, top1: 0.00000, 
 cclog: loss_scale = 8.38861e+06
rank: 1, epoch: 0, iter: 8, job: train, loss: 7.03656, top1: 0.00000, 
rank: 0, epoch: 0, iter: 8, job: train, loss: 7.03656, top1: 0.00000, 
 cclog: loss_scale = 4.1943e+06
rank: 1, epoch: 0, iter: 9, job: train, loss: 7.13277, top1: 0.00000, 
rank: 0, epoch: 0, iter: 9, job: train, loss: 7.13277, top1: 0.00000, 
 cclog: loss_scale = 2.09715e+06
rank: 1, epoch: 0, iter: 10, job: train, loss: 7.06356, top1: 0.00000, 
rank: 0, epoch: 0, iter: 10, job: train, loss: 7.06356, top1: 0.00000, 
 cclog: loss_scale = 1.04858e+06
rank: 1, epoch: 0, iter: 11, job: train, loss: 7.08941, top1: 0.00000, 
rank: 0, epoch: 0, iter: 11, job: train, loss: 7.08941, top1: 0.00000, 
 cclog: loss_scale = 524288
rank: 1, epoch: 0, iter: 12, job: train, loss: 7.00824, top1: 0.00000, 
rank: 0, epoch: 0, iter: 12, job: train, loss: 7.00824, top1: 0.00000, 
 cclog: loss_scale = 262144
rank: 1, epoch: 0, iter: 13, job: train, loss: 7.03979, top1: 0.00000, 
rank: 0, epoch: 0, iter: 13, job: train, loss: 7.03979, top1: 0.00000, 
 cclog: loss_scale = 131072
rank: 1, epoch: 0, iter: 14, job: train, loss: 7.01718, top1: 0.00000, 
rank: 0, epoch: 0, iter: 14, job: train, loss: 7.01718, top1: 0.00000, 
 cclog: loss_scale = 65536
rank: 1, epoch: 0, iter: 15, job: train, loss: 7.11946, top1: 0.00000, 
rank: 0, epoch: 0, iter: 15, job: train, loss: 7.11946, top1: 0.00000, 
 cclog: loss_scale = 32768
rank: 1, epoch: 0, iter: 16, job: train, loss: 7.06458, top1: 0.00000, 
rank: 0, epoch: 0, iter: 16, job: train, loss: 7.06458, top1: 0.00000, 
 cclog: loss_scale = 16384
rank: 1, epoch: 0, iter: 17, job: train, loss: 7.08910, top1: 0.00000, 
rank: 0, epoch: 0, iter: 17, job: train, loss: 7.08910, top1: 0.00000, 
 cclog: loss_scale = 16384
rank: 1, epoch: 0, iter: 18, job: train, loss: 7.08879, top1: 0.00000, rank: 0, epoch: 0, iter: 18, job: train, loss: 7.08879, top1: 0.00000,

strint · 2021-08-29T03:48:48Z

oneflow/core/job/plan_util.cpp

@@ -236,7 +236,12 @@ void PlanUtil::GenMemBlockAndChunkWithVariableOpNames4Plan(
            .op_conf();
    if (!op_conf.has_variable_conf()) { return false; }
    const std::string& var_name = op_conf.name();
-    if (variable_op_names.find(var_name) == variable_op_names.end()) { return false; }
+    if (variable_op_names.find(var_name) == variable_op_names.end()) {


这里是可以发现那些没有对应eager tensor的variable op对吧

是的，所以我会弹 warning 出来。目前理论上是没有的，所有的 Variable op 都应该找到对应的 eager tensor

strint · 2021-08-29T04:06:24Z

oneflow/core/framework/nn_graph.cpp

+      } else if (var_conf.initializer().has_constant_int_conf()) {
+        value = var_conf.initializer().constant_int_conf().value();
+      } else {
+        OF_UNIMPLEMENTED();


这个分支到不了

我怕后面还需要支持更多的 initializer （比如不是 constant），这里先把 if 都写全

strint

lgtm

oneflow/core/framework/nn_graph.cpp

github-actions · 2021-08-29T15:01:17Z

Speed stats:

GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.9ms (= 6396.1ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 140.8ms (= 7040.6ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.10 (= 140.8ms / 127.9ms)

OneFlow resnet50 time: 74.6ms (= 3729.4ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 84.3ms (= 4215.6ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.13 (= 84.3ms / 74.6ms)

OneFlow resnet50 time: 49.5ms (= 2474.1ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 56.0ms (= 2801.8ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.13 (= 56.0ms / 49.5ms)

OneFlow resnet50 time: 42.0ms (= 2101.7ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 46.4ms (= 2319.8ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.10 (= 46.4ms / 42.0ms)

OneFlow resnet50 time: 38.0ms (= 1902.1ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 42.5ms (= 2123.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 42.5ms / 38.0ms)

OneFlow resnet50 time: 142.8ms (= 7138.9ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 149.0ms (= 7450.1ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.04 (= 149.0ms / 142.8ms)

OneFlow resnet50 time: 90.7ms (= 4536.9ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 93.2ms (= 4660.8ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.03 (= 93.2ms / 90.7ms)

OneFlow resnet50 time: 68.0ms (= 3402.2ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 65.3ms (= 3266.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.96 (= 65.3ms / 68.0ms)

OneFlow resnet50 time: 58.5ms (= 2927.5ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 52.4ms (= 2622.0ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.90 (= 52.4ms / 58.5ms)

OneFlow resnet50 time: 58.7ms (= 2933.9ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 47.9ms (= 2394.3ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.82 (= 47.9ms / 58.7ms)

[Feat.] NNGraph create/init eager tensor for new variable conf create…

8cea358

…d in JobPass.

chengtbf added feature bottleneck blocking another feature/PR automerge bug system interface labels Aug 28, 2021

chengtbf requested review from lixinqi, liujuncheng, strint, leaves-zwx, daquexian, hjchen2 and oneflow-ci-bot August 28, 2021 16:30

github-actions bot removed the automerge label Aug 28, 2021

oneflow-ci-bot removed their request for review August 28, 2021 19:40

strint reviewed Aug 29, 2021

View reviewed changes

strint approved these changes Aug 29, 2021

View reviewed changes

hjchen2 reviewed Aug 29, 2021

View reviewed changes

oneflow/core/framework/nn_graph.cpp Show resolved Hide resolved

NNGraph new eager tensor in LazyMode false

2995c5b

chengtbf commented Aug 29, 2021

View reviewed changes

oneflow/core/framework/nn_graph.cpp Show resolved Hide resolved

chengtbf added the automerge label Aug 29, 2021

Merge branch 'master' into dev_cc_graph_jobpass_var

4364473

chengtbf requested a review from oneflow-ci-bot August 29, 2021 13:14

oneflow-ci-bot merged commit e010b7d into master Aug 29, 2021

oneflow-ci-bot deleted the dev_cc_graph_jobpass_var branch August 29, 2021 15:05

oneflow-ci-bot removed their request for review August 29, 2021 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat.] NNGraph new eager tensor for new variable created in JobPass #6091

[Feat.] NNGraph new eager tensor for new variable created in JobPass #6091

chengtbf commented Aug 28, 2021 •

edited

Loading

github-actions bot commented Aug 28, 2021

github-actions bot commented Aug 28, 2021

github-actions bot commented Aug 28, 2021

chengtbf commented Aug 29, 2021

strint Aug 29, 2021

chengtbf Aug 29, 2021

strint Aug 29, 2021

chengtbf Aug 29, 2021

chengtbf Aug 29, 2021

strint left a comment

github-actions bot commented Aug 29, 2021

[Feat.] NNGraph new eager tensor for new variable created in JobPass #6091

[Feat.] NNGraph new eager tensor for new variable created in JobPass #6091

Conversation

chengtbf commented Aug 28, 2021 • edited Loading

github-actions bot commented Aug 28, 2021

github-actions bot commented Aug 28, 2021

github-actions bot commented Aug 28, 2021

chengtbf commented Aug 29, 2021

strint Aug 29, 2021

Choose a reason for hiding this comment

chengtbf Aug 29, 2021

Choose a reason for hiding this comment

strint Aug 29, 2021

Choose a reason for hiding this comment

chengtbf Aug 29, 2021

Choose a reason for hiding this comment

chengtbf Aug 29, 2021

Choose a reason for hiding this comment

strint left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 29, 2021

chengtbf commented Aug 28, 2021 •

edited

Loading