Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scatter ops eager global bug and add test #7807

Merged
merged 42 commits into from
Apr 13, 2022

Conversation

wyg1997
Copy link
Contributor

@wyg1997 wyg1997 commented Mar 15, 2022

修复 ScatterAdd、ScatterUpdate、ScatterScalarUpdate sbp 推导的 bug,添加相关测试。

Copy link
Contributor

@Yipeng1994 Yipeng1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

甚好

from oneflow.test_utils.automated_test_util import *


@autotest(n=10, auto_backward=True, check_graph=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

说到这个次数,现在CI用的都是10次吗?
我记得上次晟航说需要加速CI,是不是说1次或者3次就够了呀? @hjchen2

Copy link
Contributor Author

@wyg1997 wyg1997 Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为里面有多个 random_sbp,1 次或 3 次覆盖不够,这个算子很快,10 次应该正好能测的全且速度不慢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

他这里可以循环10次,因为调用这个函数的地方只对placement进行了迭代,迭代次数是比较少的。

@github-actions
Copy link
Contributor

github-actions bot commented Apr 9, 2022

CI failed when running job: cuda-module-distributed-rank-1. PR label automerge has been removed

@wyg1997 wyg1997 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot April 10, 2022 14:33
@github-actions
Copy link
Contributor

CI failed when running job: cuda-module-distributed-rank-1. PR label automerge has been removed

@wyg1997 wyg1997 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot April 11, 2022 02:05
@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7807/

@github-actions
Copy link
Contributor

CI failed when running job: cuda-module. PR label automerge has been removed

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7807/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.7ms (= 12866.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.3ms (= 14232.6ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 142.3ms / 128.7ms)

OneFlow resnet50 time: 79.1ms (= 7910.2ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.7ms (= 8365.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.06 (= 83.7ms / 79.1ms)

OneFlow resnet50 time: 52.1ms (= 10417.6ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.8ms (= 11752.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.13 (= 58.8ms / 52.1ms)

OneFlow resnet50 time: 43.5ms (= 8697.0ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 47.8ms (= 9562.6ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.10 (= 47.8ms / 43.5ms)

OneFlow resnet50 time: 37.2ms (= 7441.4ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 44.0ms (= 8790.4ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.18 (= 44.0ms / 37.2ms)

OneFlow swin dataloader time: 0.249s (= 49.723s / 200, num_workers=1)
PyTorch swin dataloader time: 0.246s (= 49.270s / 200, num_workers=1)
✔️ Relative speed: 0.991 (= 0.246s / 0.249s)

OneFlow swin dataloader time: 0.070s (= 13.931s / 200, num_workers=4)
PyTorch swin dataloader time: 0.071s (= 14.298s / 200, num_workers=4)
✔️ Relative speed: 1.026 (= 0.071s / 0.070s)

OneFlow swin dataloader time: 0.035s (= 7.015s / 200, num_workers=8)
PyTorch swin dataloader time: 0.037s (= 7.491s / 200, num_workers=8)
✔️ Relative speed: 1.068 (= 0.037s / 0.035s)

✔️ OneFlow resnet50 time: 135.9ms (= 13592.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 156.0ms (= 15600.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 156.0ms / 135.9ms)

OneFlow resnet50 time: 88.8ms (= 8882.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.6ms (= 10555.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 105.6ms / 88.8ms)

OneFlow resnet50 time: 60.3ms (= 12068.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.6ms (= 15315.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 76.6ms / 60.3ms)

OneFlow resnet50 time: 50.6ms (= 10125.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 66.6ms (= 13317.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 66.6ms / 50.6ms)

OneFlow resnet50 time: 50.3ms (= 10059.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 61.7ms (= 12338.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.23 (= 61.7ms / 50.3ms)

@mergify mergify bot merged commit 3ef2759 into master Apr 13, 2022
@mergify mergify bot deleted the test-eager_global_scatter_ops branch April 13, 2022 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants