Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use attr nd_sbp to check consistent #6222

Merged
merged 9 commits into from
Sep 10, 2021
Merged

Conversation

leaves-zwx
Copy link
Contributor

  • 使用 data reader op 的属性 nd_sbp 来判断 op 是否是 consistent,如果 nd_sbp 为空,说明 op 不为 consistent,此时我们认为 data reader op 工作在 DDP 下
  • 导出 amp_white_identity function

@leaves-zwx leaves-zwx requested a review from chengtbf September 9, 2021 15:23
@leaves-zwx leaves-zwx self-assigned this Sep 9, 2021
@oneflow-ci-bot oneflow-ci-bot removed their request for review September 9, 2021 15:46
@github-actions
Copy link
Contributor

github-actions bot commented Sep 9, 2021

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.7ms (= 6384.1ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.0ms (= 7049.2ms / 50, input_shape=[16, 3, 224, 224])
Relative speed: 1.10 (= 141.0ms / 127.7ms)

OneFlow resnet50 time: 74.4ms (= 3717.9ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.4ms (= 4218.7ms / 50, input_shape=[8, 3, 224, 224])
Relative speed: 1.13 (= 84.4ms / 74.4ms)

OneFlow resnet50 time: 48.5ms (= 2422.9ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 60.3ms (= 3015.5ms / 50, input_shape=[4, 3, 224, 224])
Relative speed: 1.24 (= 60.3ms / 48.5ms)

OneFlow resnet50 time: 43.6ms (= 2178.0ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 48.7ms (= 2432.8ms / 50, input_shape=[2, 3, 224, 224])
Relative speed: 1.12 (= 48.7ms / 43.6ms)

OneFlow resnet50 time: 38.9ms (= 1946.6ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 38.7ms (= 1934.8ms / 50, input_shape=[1, 3, 224, 224])
Relative speed: 0.99 (= 38.7ms / 38.9ms)

if (IsMirroredParallelContext(ctx->parallel_ctx())) {
// NOTE(zwx): OFRecordDataset is not consistent since attr nd_sbp is empty,
// we assume that it works in DDP
auto nd_sbp_str_vec = ctx->Attr<std::vector<std::string>>("nd_sbp");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

反正都是从 KernelInitContext 里取,是不是还是可以抽成一个单独的函数?IsMirroredKernelContext 之类的?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以

// NOTE(zwx): OFRecordDataset is not consistent since attr nd_sbp is empty,
// we assume that it works in DDP
bool is_local = false;
if (ctx->op_type_name() == "OFRecordReader") {
Copy link
Contributor Author

@leaves-zwx leaves-zwx Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里对 OFRecordReader 做了特判,因为 OFRecordDataset 除了被 OFRecordDataReader 使用还被 OFRecordImageClassificationDataReader 使用。

前者需要兼容 DDP 的用法,所以这里判断了 is local;而后者目前还未被移动到 multi-client 下,且其没有 nd_sbp attr,而且看起来可能是个将要被废弃的接口,所以暂时不对其做修改(添加 nd_sbp attr),而是选择在这里做特判,如果 OFRecordImageClassificationDataReader 将来被删除或移植到 multi-client 下,这里的特判可以去掉。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里差一点就分析到问题的本质了。不应该加这个特判;而是有两种解决方案:

  1. 判断是否为 Multi-Client
  2. 修改 Single-Client 下的 reader op 的 python 脚本,全部加上 nd_sbp(直接加默认值 S(0) 就行 )的默认配置。

我倾向于选择方案 2,并删掉此处的特判。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OFRecordImageClassificationDataReader 没这个 attr nd_sbp,不是没传

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的问题在于没有判断 IsMultiClient,如果是 single client,那么 is_local 应该总为 false,提交了一个修复 pr #6288

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你可以先这么修。

更合适的方案应该是修改 Single-Client 下的 op 使之支持这个 nd_sbp 的约束吧。

同时还需要检查 parallel num。在 is local 下, parallel num 一定是 1; 如果 parallel num > 1,那么一定是 Consistent 模式;但是如果 parallel num = 1 ,则不一定是 local 模式。所以 parallel num 是判断的必要不充分条件?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,不过 local 模式下,本来 parallel num 就是人工设置的一个假的,主要是为了兼容旧的写法,新的方式下 local 应该避免访问 parallel_ctx

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 10, 2021 05:37
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.3ms (= 6415.9ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.7ms (= 7133.5ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 142.7ms / 128.3ms)

OneFlow resnet50 time: 74.4ms (= 3718.4ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.0ms (= 4252.4ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.14 (= 85.0ms / 74.4ms)

OneFlow resnet50 time: 48.9ms (= 2446.2ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.3ms (= 2962.9ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.21 (= 59.3ms / 48.9ms)

OneFlow resnet50 time: 42.1ms (= 2106.4ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 49.3ms (= 2463.7ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.17 (= 49.3ms / 42.1ms)

OneFlow resnet50 time: 43.4ms (= 2170.6ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.0ms (= 2151.6ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.99 (= 43.0ms / 43.4ms)

OneFlow resnet50 time: 153.4ms (= 7670.9ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 166.8ms (= 8338.4ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.09 (= 166.8ms / 153.4ms)

OneFlow resnet50 time: 99.3ms (= 4963.8ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 99.8ms (= 4992.3ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.01 (= 99.8ms / 99.3ms)

OneFlow resnet50 time: 77.9ms (= 3893.2ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.5ms (= 3774.0ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.97 (= 75.5ms / 77.9ms)

OneFlow resnet50 time: 79.7ms (= 3986.1ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.6ms (= 3727.8ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.94 (= 74.6ms / 79.7ms)

OneFlow resnet50 time: 77.2ms (= 3859.9ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 64.8ms (= 3237.9ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.84 (= 64.8ms / 77.2ms)

@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 10, 2021 06:40
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 10, 2021 11:38
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 10, 2021 12:30
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.2ms (= 6411.7ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 138.6ms (= 6931.8ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.08 (= 138.6ms / 128.2ms)

OneFlow resnet50 time: 74.3ms (= 3714.7ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.4ms (= 4169.3ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 83.4ms / 74.3ms)

OneFlow resnet50 time: 51.0ms (= 2550.2ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 66.8ms (= 3339.2ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.31 (= 66.8ms / 51.0ms)

OneFlow resnet50 time: 43.2ms (= 2160.3ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 51.4ms (= 2570.4ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.19 (= 51.4ms / 43.2ms)

OneFlow resnet50 time: 51.9ms (= 2596.0ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 46.7ms (= 2336.1ms / 50, input_shape=[1, 3, 224, 224])
❌ Relative speed: 0.90 (= 46.7ms / 51.9ms)

OneFlow resnet50 time: 153.2ms (= 7658.8ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 164.4ms (= 8220.7ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.07 (= 164.4ms / 153.2ms)

OneFlow resnet50 time: 96.0ms (= 4797.5ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.6ms (= 5281.6ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.10 (= 105.6ms / 96.0ms)

OneFlow resnet50 time: 80.1ms (= 4004.1ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.0ms (= 3899.2ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.97 (= 78.0ms / 80.1ms)

OneFlow resnet50 time: 77.7ms (= 3885.9ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.3ms (= 3767.0ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.97 (= 75.3ms / 77.7ms)

OneFlow resnet50 time: 76.0ms (= 3801.4ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.1ms (= 3403.5ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.90 (= 68.1ms / 76.0ms)

@oneflow-ci-bot oneflow-ci-bot merged commit e1a1656 into master Sep 10, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the fix_reader_pipeline branch September 10, 2021 13:36
// OFRecordImageClassificationDataReader had supported DDP (add attr nd_sbp)
// or been deprecated.
if (ctx->op_type_name() == "OFRecordReader") {
auto nd_sbp_str_vec = ctx->Attr<std::vector<std::string>>("nd_sbp");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断条件仍然是有问题的。在 Single-Client 下,老版本的 reader op 没有配置 nd_sbp 属性(因为 nd_sbp 是后来单独加的)。 但 Single-Client 只支持 Consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是,已在 #6288 提交了修复,去判断是否在 multi-client 下,如果在 single client 下,is_local 总为 false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants