Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logical slice ops support all nd_sbp #8313

Merged
merged 35 commits into from
May 31, 2022

Conversation

wyg1997
Copy link
Contributor

@wyg1997 wyg1997 commented May 26, 2022

LogicalSliceAssign/LogicalSlice 支持所有的 nd_sbp 输入。

LogicalSliceAssign

LogicalSliceAssign 支持的 sbp 组合有(y = logical_slice_assign(ref, slice, value) ):

ref value y
B B B
P P P
S B S

即保证 value 在所有 rank 上都有所有的数据,在 kernel 里经过推导后,找出需要 copy 的数据块(Boradcast 和 PartialSum 是全拷贝,Split 需要推导)。

LogicalSlice

LogicalSlice 支持的 sbp 组合为( y = logical_slice(x , slice) ):

x y
B B
S P
P P

B->B 和 P->P 比较好理解,就是把 slice 的数据块完整的拷贝。S->P 的逻辑是这样的:给 y 开辟完整的内存空间,初始化值均为 0,然后各 rank 推导 x 在这片内存上所在的 SliceView,对应拷贝就可以了。示意图为:

image

和 SliceUpdate/SliceOp 的区别

  1. SliceUpdate 和 SliceOp 没有覆盖到所有的 sbp,只能在 FullSlice 的维度上 Split,例如 x[:2, :] = 1,输入的 x 就只能是 Split(1),但 SliceUpdate 又是一个 inplace 操作,就会无法执行。
  2. 对于 FullSlice 的维度,SliceUpdate 是支持 ref(Split) + value(Split) = y(Split) 的计算的,这样比 LogicalSliceUpdate 更节省显存。
  3. 所以两者使用场景上的区别为:LogicalSlice 用在 input 不能 boxing 的场景(inplace op,或上游设置了 nd_sbp_constraints),其它的场景用普通的 SliceOp 就可以。对于 Tensor.setitem 就只能用 LogicalSliceUpdate 了。

TODO

  1. LogicalSlice/LogicalSliceUpdate 绑定后向。
  2. LogicalSlice/LogicalSliceUpdate kernel 内支持 FullSlice 的 Split,完全兼容 Slice/SliceUpdate。

@hjchen2
Copy link
Contributor

hjchen2 commented May 28, 2022

Slice update如果不是full slice,value也让只能broadcast,这样是不是logical slice update和slice update就一样的了,只是kernel需要再推导一遍

@wyg1997
Copy link
Contributor Author

wyg1997 commented May 28, 2022

Slice update如果不是full slice,value也让只能broadcast,这样是不是logical slice update和slice update就一样的了,只是kernel需要再推导一遍

SliceUpdate kernel 内没有实现很复杂的逻辑,只保证了所需要的数据都在本 rank 内,然后把 value 直接复制过去,它不支持 S+B->S 的计算,只能 FullSlice kernel 才能正常工作,所以后来才有了 LogicalSliceAssign。

@github-actions
Copy link
Contributor

CI failed when running job: cpu-module. PR label automerge has been removed

@github-actions
Copy link
Contributor

Speed stats:

@wyg1997
Copy link
Contributor Author

wyg1997 commented May 31, 2022

test_case = <test_consistent_argmin.TestArgmin testMethod=test_argmin> 结果错误,本地没有复现,和本 PR 也没有关系,重跑一下

@wyg1997 wyg1997 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 31, 2022 02:03
@github-actions
Copy link
Contributor

CI failed when running job: cpu-module. PR label automerge has been removed

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 130.5ms (= 13052.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 144.1ms (= 14405.7ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 144.1ms / 130.5ms)

OneFlow resnet50 time: 78.7ms (= 7873.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 89.1ms (= 8907.8ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 89.1ms / 78.7ms)

OneFlow resnet50 time: 53.6ms (= 10715.4ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 53.9ms (= 10772.6ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.01 (= 53.9ms / 53.6ms)

OneFlow resnet50 time: 40.7ms (= 8149.0ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.2ms (= 8843.8ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.09 (= 44.2ms / 40.7ms)

OneFlow resnet50 time: 37.3ms (= 7467.5ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 35.2ms (= 7045.9ms / 200, input_shape=[1, 3, 224, 224])
❌ Relative speed: 0.94 (= 35.2ms / 37.3ms)




❌ OneFlow resnet50 time: 151.7ms (= 15170.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 176.3ms (= 17631.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 176.3ms / 151.7ms)

OneFlow resnet50 time: 96.9ms (= 9692.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.3ms (= 11234.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 112.3ms / 96.9ms)

OneFlow resnet50 time: 71.8ms (= 14361.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.4ms (= 17479.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 87.4ms / 71.8ms)

OneFlow resnet50 time: 57.9ms (= 11585.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.3ms (= 15659.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 78.3ms / 57.9ms)

OneFlow resnet50 time: 55.0ms (= 11004.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.0ms (= 14000.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 70.0ms / 55.0ms)

@wyg1997 wyg1997 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 31, 2022 06:19
@github-actions
Copy link
Contributor

Speed stats:

@wyg1997 wyg1997 force-pushed the feat-logical_slice_ops_support_all_sbp branch from 28e1124 to 44be540 Compare May 31, 2022 08:10
@wyg1997
Copy link
Contributor Author

wyg1997 commented May 31, 2022

还是挂 argmin,重跑一下

@wyg1997 wyg1997 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 31, 2022 10:19
@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8313/

@mergify mergify bot merged commit e2821d9 into master May 31, 2022
@mergify mergify bot deleted the feat-logical_slice_ops_support_all_sbp branch May 31, 2022 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants