Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat.] NNGraph new eager tensor for new variable created in JobPass #6091

Merged
merged 3 commits into from
Aug 29, 2021

Conversation

chengtbf
Copy link
Contributor

@chengtbf chengtbf commented Aug 28, 2021

重要功能 和 BUG 修复:

NNGraph 支持为 JobPass 里创建的 Variable Conf :

  • 构造新的 EagerTensor,
  • 并初始化,
  • 并将该 Tensor 存储在 SessionContext 中,
  • 并在 Runtime 启动以后绑定给 Regst

已在本地验证过 AMP 单卡和多卡(2卡)的训练正常

该PR 解决了 AMP 训练 DynamicLossScale 初始值为 0 的问题(原本应为 2^30)。

之前没有为这些 JobPass 中的 Variable 创建 Tensor 并初始化,但恰巧绝大多数这类的 Variable (比如 BN 的 model.bn1.weight-momentum)的 initializer 都是 constant 0 ,所以没有引发实际训练的问题。本 PR 修复此 BUG。

TODO:

  • nn.Graph 支持 save/load
  • nn.Graph 需要把这些创建的 Variable (如 loss scale、good step counter 等) 的 Tensor 也记录在自己的 state dict 中,并支持 save 和 load 这些 Variable

@github-actions
Copy link
Contributor

CI failed, removing label automerge

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.1ms (= 6404.0ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 143.4ms (= 7170.2ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 143.4ms / 128.1ms)

OneFlow resnet50 time: 74.7ms (= 3737.4ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 82.7ms (= 4135.7ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.11 (= 82.7ms / 74.7ms)

OneFlow resnet50 time: 48.4ms (= 2421.3ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 55.0ms (= 2748.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.14 (= 55.0ms / 48.4ms)

OneFlow resnet50 time: 42.7ms (= 2132.9ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 47.7ms (= 2383.4ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 47.7ms / 42.7ms)

OneFlow resnet50 time: 40.5ms (= 2026.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 43.3ms (= 2163.1ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.07 (= 43.3ms / 40.5ms)

OneFlow resnet50 time: 144.3ms (= 7216.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 150.0ms (= 7497.9ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.04 (= 150.0ms / 144.3ms)

OneFlow resnet50 time: 94.0ms (= 4701.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 92.6ms (= 4628.5ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.98 (= 92.6ms / 94.0ms)

OneFlow resnet50 time: 67.0ms (= 3348.4ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 62.4ms (= 3117.8ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.93 (= 62.4ms / 67.0ms)

OneFlow resnet50 time: 62.7ms (= 3133.5ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 47.8ms (= 2392.2ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.76 (= 47.8ms / 62.7ms)

OneFlow resnet50 time: 55.5ms (= 2776.7ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 49.1ms (= 2452.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.88 (= 49.1ms / 55.5ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 28, 2021 19:40
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.0ms (= 6400.5ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 140.4ms (= 7019.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.10 (= 140.4ms / 128.0ms)

OneFlow resnet50 time: 74.5ms (= 3723.9ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 85.6ms (= 4278.8ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.15 (= 85.6ms / 74.5ms)

OneFlow resnet50 time: 47.6ms (= 2378.0ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 58.4ms (= 2918.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.23 (= 58.4ms / 47.6ms)

OneFlow resnet50 time: 45.4ms (= 2267.6ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 50.9ms (= 2545.1ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 50.9ms / 45.4ms)

OneFlow resnet50 time: 42.3ms (= 2113.7ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 44.3ms (= 2215.2ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.05 (= 44.3ms / 42.3ms)

OneFlow resnet50 time: 142.8ms (= 7139.0ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 150.3ms (= 7514.0ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.05 (= 150.3ms / 142.8ms)

OneFlow resnet50 time: 94.7ms (= 4733.4ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 94.6ms (= 4732.0ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.00 (= 94.6ms / 94.7ms)

OneFlow resnet50 time: 67.4ms (= 3367.5ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 62.3ms (= 3116.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.93 (= 62.3ms / 67.4ms)

OneFlow resnet50 time: 60.9ms (= 3046.1ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 51.4ms (= 2569.6ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.84 (= 51.4ms / 60.9ms)

OneFlow resnet50 time: 60.8ms (= 3038.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 50.8ms (= 2540.0ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.84 (= 50.8ms / 60.8ms)

@chengtbf
Copy link
Contributor Author

本地测试日志:

-------------------- end of arguments ---------------------
***** Model Init *****
***** Model Init Finish, time escapled: 0.12776 s *****
 cclog: loss_scale = 1.07374e+09
rank: 1, epoch: 0, iter: 1, job: train, loss: 7.06198, top1: 0.00000, 
rank: 0, epoch: 0, iter: 1, job: train, loss: 7.06198, top1: 0.00000, 
 cclog: loss_scale = 5.36871e+08
rank: 1, epoch: 0, iter: 2, job: train, loss: 6.98842, top1: 0.00000, 
rank: 0, epoch: 0, iter: 2, job: train, loss: 6.98842, top1: 0.00000, 
 cclog: loss_scale = 2.68435e+08
rank: 1, epoch: 0, iter: 3, job: train, loss: 6.98012, top1: 0.00000, 
rank: 0, epoch: 0, iter: 3, job: train, loss: 6.98012, top1: 0.00000, 
 cclog: loss_scale = 1.34218e+08
rank: 1, epoch: 0, iter: 4, job: train, loss: 7.07354, top1: 0.00000, 
rank: 0, epoch: 0, iter: 4, job: train, loss: 7.07354, top1: 0.00000, 
 cclog: loss_scale = 6.71089e+07
rank: 1, epoch: 0, iter: 5, job: train, loss: 7.06426, top1: 0.00000, 
rank: 0, epoch: 0, iter: 5, job: train, loss: 7.06426, top1: 0.00000, 
 cclog: loss_scale = 3.35544e+07
rank: 1, epoch: 0, iter: 6, job: train, loss: 7.06677, top1: 0.00000, 
rank: 0, epoch: 0, iter: 6, job: train, loss: 7.06677, top1: 0.00000, 
 cclog: loss_scale = 1.67772e+07
rank: 1, epoch: 0, iter: 7, job: train, loss: 7.08927, top1: 0.00000, 
rank: 0, epoch: 0, iter: 7, job: train, loss: 7.08927, top1: 0.00000, 
 cclog: loss_scale = 8.38861e+06
rank: 1, epoch: 0, iter: 8, job: train, loss: 7.03656, top1: 0.00000, 
rank: 0, epoch: 0, iter: 8, job: train, loss: 7.03656, top1: 0.00000, 
 cclog: loss_scale = 4.1943e+06
rank: 1, epoch: 0, iter: 9, job: train, loss: 7.13277, top1: 0.00000, 
rank: 0, epoch: 0, iter: 9, job: train, loss: 7.13277, top1: 0.00000, 
 cclog: loss_scale = 2.09715e+06
rank: 1, epoch: 0, iter: 10, job: train, loss: 7.06356, top1: 0.00000, 
rank: 0, epoch: 0, iter: 10, job: train, loss: 7.06356, top1: 0.00000, 
 cclog: loss_scale = 1.04858e+06
rank: 1, epoch: 0, iter: 11, job: train, loss: 7.08941, top1: 0.00000, 
rank: 0, epoch: 0, iter: 11, job: train, loss: 7.08941, top1: 0.00000, 
 cclog: loss_scale = 524288
rank: 1, epoch: 0, iter: 12, job: train, loss: 7.00824, top1: 0.00000, 
rank: 0, epoch: 0, iter: 12, job: train, loss: 7.00824, top1: 0.00000, 
 cclog: loss_scale = 262144
rank: 1, epoch: 0, iter: 13, job: train, loss: 7.03979, top1: 0.00000, 
rank: 0, epoch: 0, iter: 13, job: train, loss: 7.03979, top1: 0.00000, 
 cclog: loss_scale = 131072
rank: 1, epoch: 0, iter: 14, job: train, loss: 7.01718, top1: 0.00000, 
rank: 0, epoch: 0, iter: 14, job: train, loss: 7.01718, top1: 0.00000, 
 cclog: loss_scale = 65536
rank: 1, epoch: 0, iter: 15, job: train, loss: 7.11946, top1: 0.00000, 
rank: 0, epoch: 0, iter: 15, job: train, loss: 7.11946, top1: 0.00000, 
 cclog: loss_scale = 32768
rank: 1, epoch: 0, iter: 16, job: train, loss: 7.06458, top1: 0.00000, 
rank: 0, epoch: 0, iter: 16, job: train, loss: 7.06458, top1: 0.00000, 
 cclog: loss_scale = 16384
rank: 1, epoch: 0, iter: 17, job: train, loss: 7.08910, top1: 0.00000, 
rank: 0, epoch: 0, iter: 17, job: train, loss: 7.08910, top1: 0.00000, 
 cclog: loss_scale = 16384
rank: 1, epoch: 0, iter: 18, job: train, loss: 7.08879, top1: 0.00000, rank: 0, epoch: 0, iter: 18, job: train, loss: 7.08879, top1: 0.00000,

@@ -236,7 +236,12 @@ void PlanUtil::GenMemBlockAndChunkWithVariableOpNames4Plan(
.op_conf();
if (!op_conf.has_variable_conf()) { return false; }
const std::string& var_name = op_conf.name();
if (variable_op_names.find(var_name) == variable_op_names.end()) { return false; }
if (variable_op_names.find(var_name) == variable_op_names.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是可以发现那些没有对应eager tensor的variable op对吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,所以我会弹 warning 出来。 目前理论上是没有的,所有的 Variable op 都应该找到对应的 eager tensor

} else if (var_conf.initializer().has_constant_int_conf()) {
value = var_conf.initializer().constant_int_conf().value();
} else {
OF_UNIMPLEMENTED();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个分支到不了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我怕后面还需要支持更多的 initializer (比如不是 constant),这里先把 if 都写全

Copy link
Contributor

@strint strint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.9ms (= 6396.1ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 140.8ms (= 7040.6ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.10 (= 140.8ms / 127.9ms)

OneFlow resnet50 time: 74.6ms (= 3729.4ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 84.3ms (= 4215.6ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.13 (= 84.3ms / 74.6ms)

OneFlow resnet50 time: 49.5ms (= 2474.1ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 56.0ms (= 2801.8ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.13 (= 56.0ms / 49.5ms)

OneFlow resnet50 time: 42.0ms (= 2101.7ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 46.4ms (= 2319.8ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.10 (= 46.4ms / 42.0ms)

OneFlow resnet50 time: 38.0ms (= 1902.1ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 42.5ms (= 2123.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 42.5ms / 38.0ms)

OneFlow resnet50 time: 142.8ms (= 7138.9ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 149.0ms (= 7450.1ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.04 (= 149.0ms / 142.8ms)

OneFlow resnet50 time: 90.7ms (= 4536.9ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 93.2ms (= 4660.8ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.03 (= 93.2ms / 90.7ms)

OneFlow resnet50 time: 68.0ms (= 3402.2ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 65.3ms (= 3266.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.96 (= 65.3ms / 68.0ms)

OneFlow resnet50 time: 58.5ms (= 2927.5ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 52.4ms (= 2622.0ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.90 (= 52.4ms / 58.5ms)

OneFlow resnet50 time: 58.7ms (= 2933.9ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 47.9ms (= 2394.3ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.82 (= 47.9ms / 58.7ms)

@oneflow-ci-bot oneflow-ci-bot merged commit e010b7d into master Aug 29, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the dev_cc_graph_jobpass_var branch August 29, 2021 15:05
@oneflow-ci-bot oneflow-ci-bot removed their request for review August 29, 2021 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants