Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source op support s and fixed generator bug #7571

Merged
merged 144 commits into from
May 2, 2022
Merged

Conversation

grybd
Copy link
Contributor

@grybd grybd commented Feb 23, 2022

这个PR的目的

random op 支持 global tensor 一致性

  1. 在处理 randint op 和 rand op 支持B/S 保持global tensor 的一致性所采取的方案是利用 GetOpKernelRandomSeed(ctx)这个工具函数进行设计,当op 支持 S时 不同rank 间调用GetOpKernelRandomSeed(ctx) 返回一个不同的seed,再通过generator->set_current_seed(ctx->Attr<int64_t>("seed") + GetOpKernelRandomSeed(ctx)) 就可以为每个rank 设计不同的seed,这样能保证uniform 类的kernel 经过S 生成 同分布 不同数值的local tensor ,当op支持B时 每个rank 上 kernel 调用GetOpKernelRandomSeed(ctx) 时会生成相同的seed ,再通过generator->set_current_seed(ctx->Attr<int64_t>("seed") + GetOpKernelRandomSeed(ctx)) 就保证了每个rank 都拿到了相同的seed,这样就可以保持global tensor 的一致性
  2. 在处理 randperm op 和 arange op 支持 B/S 时 保持 global tensor 的一致性,目前打算采用的处理方案是让多个rank 公用seed 然后在 先在每个rank上生成完整的tensor再根据 infer physic shape信息利GetTensorSliceView4ParallelId(parallel_hierarchy, nd_sbp, logical_shape, parallel_id) 这个工具函数,获得本rank_id 和 physic shape所对应的tensor 上的索引信息,再把对应的位置的数据拷贝到 本rank 的local tensor 上

以上方案 是通过与xiaoyu,yinggang开会总结出来的

fixed: https://github.com/Oneflow-Inc/OneTeam/issues/1167

@grybd
Copy link
Contributor Author

grybd commented Feb 23, 2022

testcase 还没跑通, 我明天再看看~,麻烦有空也帮忙review一下哈 @strint @wyg1997

@grybd
Copy link
Contributor Author

grybd commented Feb 24, 2022

跑./test_consistent_arange.py会报错,报错信息:
File "/home/fengdaochao/fdz/oneflow/oneflow/core/job/job_build_and_infer_ctx.cpp", line 630, in AddAndInferOp
CheckOpBlobSplitability(op, parallel_desc.parallel_num())
File "/home/fengdaochao/fdz/oneflow/oneflow/core/job/job_build_and_infer_ctx.cpp", line 335, in CheckOpBlobSplitability
Check failed: (current_shape.At(axis) % parallel_hierarchy->At(i)) == (0) (1 vs 0) op_name: arange-0 blob_name: out_0 cannot split blob by parallel_hierarchy: 2

跑./test_consistent_randperm.py会报错,报错信息:
F20220224 15:16:32.818838 986938 user_kernel.cpp:171] Check failed: parallel_desc_.hierarchy()->NumAxes() == 1 (2 vs. 1)
*** Check failure stack trace: ***
F20220224 15:16:32.819939 986955 user_kernel.cpp:171] Check failed: parallel_desc_.hierarchy()->NumAxes() == 1 (2 vs. 1)
*** Check failure stack trace: ***
@ 0x7f1cd9c17f97 google::LogMessage::Fail()
@ 0x7f1266515f97 google::LogMessage::Fail()
@ 0x7f1cd9c17ed8 google::LogMessage::SendToLog()
@ 0x7f1266515ed8 google::LogMessage::SendToLog()
@ 0x7f1cd9c177bb google::LogMessage::Flush()
@ 0x7f12665157bb google::LogMessage::Flush()
@ 0x7f1cd9c1b31c google::LogMessageFatal::~LogMessageFatal()
@ 0x7f126651931c google::LogMessageFatal::~LogMessageFatal()
@ 0x7f1cd3066957 oneflow::UserKernelInitAndCacheContext::SbpParallel4ArgNameAndIndex()
@ 0x7f125f964957 oneflow::UserKernelInitAndCacheContext::SbpParallel4ArgNameAndIndex()
@ 0x7f1cd5c83fd5 oneflow::GetOpKernelSeed()
@ 0x7f1cd5c85e81 oneflow::CpuRandPermKernel::CreateOpKernelState()
@ 0x7f1262581fd5 oneflow::GetOpKernelSeed()
@ 0x7f1262583e81 oneflow::CpuRandPermKernel::CreateOpKernelState()
@ 0x7f1cd3060655 oneflow::UserKernel::CreateOpKernelState()
@ 0x7f1cd3060ad8 oneflow::UserKernel::VirtualKernelInit()
@ 0x7f125f95e655 oneflow::UserKernel::CreateOpKernelState()
@ 0x7f1cd2fea86d oneflow::Kernel::Init()
@ 0x7f125f95ead8 oneflow::UserKernel::VirtualKernelInit()
@ 0x7f1cd2feacef oneflow::ConstructKernel()
@ 0x7f1cd30a6e95 oneflow::Actor::Init()
@ 0x7f125f8e886d oneflow::Kernel::Init()
@ 0x7f1cd30bdfa9 oneflow::NewActor()
@ 0x7f125f8e8cef oneflow::ConstructKernel()
@ 0x7f1cd44799c9 oneflow::Thread::ConstructActor()
@ 0x7f1cd4479411 oneflow::Thread::PollMsgChannel()
@ 0x7f125f9a4e95 oneflow::Actor::Init()
@ 0x7f1cd4478ae7 _ZZN7oneflow6ThreadC4ERKNS_8StreamIdEENKUlvE_clEv
@ 0x7f1cd447a4e3 ZSt13__invoke_implIvZN7oneflow6ThreadC4ERKNS0_8StreamIdEEUlvE_JEET_St14__invoke_otherOT0_DpOT1
@ 0x7f1cd447a484 ZSt8__invokeIZN7oneflow6ThreadC4ERKNS0_8StreamIdEEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS7_DpOS8
@ 0x7f125f9bbfa9 oneflow::NewActor()
@ 0x7f1cd447a422 _ZNSt6thread8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS2_8StreamIdEEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
@ 0x7f1cd447a3e3 _ZNSt6thread8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS2_8StreamIdEEUlvE_EEEclEv
@ 0x7f1cd447a3b8 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
@ 0x7f1c7f0d8de4 (unknown)
@ 0x7f1d2c76e609 start_thread
@ 0x7f1d2c8aa293 clone
@ 0x7f1260d779c9 oneflow::Thread::ConstructActor()
@ 0x7f1260d77411 oneflow::Thread::PollMsgChannel()
@ 0x7f1260d76ae7 _ZZN7oneflow6ThreadC4ERKNS_8StreamIdEENKUlvE_clEv
@ 0x7f1260d784e3 ZSt13__invoke_implIvZN7oneflow6ThreadC4ERKNS0_8StreamIdEEUlvE_JEET_St14__invoke_otherOT0_DpOT1
@ 0x7f1260d78484 ZSt8__invokeIZN7oneflow6ThreadC4ERKNS0_8StreamIdEEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS7_DpOS8
@ 0x7f1260d78422 _ZNSt6thread8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS2_8StreamIdEEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
@ 0x7f1260d783e3 _ZNSt6thread8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS2_8StreamIdEEUlvE_EEEclEv
@ 0x7f1260d783b8 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
@ 0x7f120b9d6de4 (unknown)
@ 0x7f12b906c609 start_thread
@ 0x7f12b91a8293 clone
Killing subprocess 986513
Killing subprocess 986514

补充

torch.arange 文档链接: https://pytorch.org/docs/stable/generated/torch.arange.html
torch.randperm 文档 https://pytorch.org/docs/stable/generated/torch.randperm.html?highlight=randperm

@grybd
Copy link
Contributor Author

grybd commented Feb 24, 2022

合并 fix-all_Sbp4ArgNameAndIndex_bug 分支后跑 test_consistent_randperm.py 还会报错,报错信息:
F20220224 23:03:03.118664 1706926 user_kernel.cpp:171] Check failed: parallel_desc_.hierarchy()->NumAxes() == 1 (2 vs. 1)
*** Check failure stack trace: ***
F20220224 23:03:03.120785 1706914 user_kernel.cpp:171] Check failed: parallel_desc_.hierarchy()->NumAxes() == 1 (2 vs. 1)

跑 test_consistent_arange.py 报错,报错信息
File "/home/fengdaochao/fdz/oneflow/oneflow/core/job/job_build_and_infer_ctx.cpp", line 335, in CheckOpBlobSplitability
Check failed: (current_shape.At(axis) % parallel_hierarchy->At(i)) == (0) (1 vs 0) op_name: arange-0 blob_name: out_0 cannot split blob by parallel_hierarchy: 2

@strint @wyg1997 问题应该是比较明确的么,目前还不知道怎么去改,有时间麻烦帮忙看看~

@wyg1997
Copy link
Contributor

wyg1997 commented Feb 24, 2022

合并 fix-all_Sbp4ArgNameAndIndex_bug 分支后跑 test_consistent_randperm.py 还会报错,报错信息: F20220224 23:03:03.118664 1706926 user_kernel.cpp:171] Check failed: parallel_desc_.hierarchy()->NumAxes() == 1 (2 vs. 1) *** Check failure stack trace: *** F20220224 23:03:03.120785 1706914 user_kernel.cpp:171] Check failed: parallel_desc_.hierarchy()->NumAxes() == 1 (2 vs. 1)

跑 test_consistent_arange.py 报错,报错信息 File "/home/fengdaochao/fdz/oneflow/oneflow/core/job/job_build_and_infer_ctx.cpp", line 335, in CheckOpBlobSplitability Check failed: (current_shape.At(axis) % parallel_hierarchy->At(i)) == (0) (1 vs 0) op_name: arange-0 blob_name: out_0 cannot split blob by parallel_hierarchy: 2

@strint @wyg1997 问题应该是比较明确的么,目前还不知道怎么去改,有时间麻烦帮忙看看~

第二个看起来是不能切分导致的;第一个似乎是在哪里又调用了 SbpParallel4ArgNameAndIndex 接口,定位一下应该不难解决。

@grybd grybd requested a review from chengtbf as a code owner March 1, 2022 04:02
@grybd grybd requested review from oneflow-ci-bot and removed request for oneflow-ci-bot April 30, 2022 01:13
@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7571/

@github-actions
Copy link
Contributor

CI failed when running job: cpu-module. PR label automerge has been removed

@grybd grybd added the automerge label May 2, 2022
@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7571/

@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2022

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.5ms (= 12948.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.8ms (= 14280.6ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 142.8ms / 129.5ms)

OneFlow resnet50 time: 78.4ms (= 7842.6ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.7ms (= 8467.5ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.08 (= 84.7ms / 78.4ms)

OneFlow resnet50 time: 52.7ms (= 10537.4ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 61.8ms (= 12357.3ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.17 (= 61.8ms / 52.7ms)

OneFlow resnet50 time: 40.4ms (= 8075.3ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.8ms (= 8965.6ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.11 (= 44.8ms / 40.4ms)

OneFlow resnet50 time: 39.9ms (= 7973.6ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.0ms (= 7201.4ms / 200, input_shape=[1, 3, 224, 224])
❌ Relative speed: 0.90 (= 36.0ms / 39.9ms)

OneFlow swin dataloader time: 0.253s (= 50.674s / 200, num_workers=1)
PyTorch swin dataloader time: 0.152s (= 30.311s / 200, num_workers=1)
Relative speed: 0.598 (= 0.152s / 0.253s)

OneFlow swin dataloader time: 0.066s (= 13.210s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.575s / 200, num_workers=4)
Relative speed: 0.649 (= 0.043s / 0.066s)

OneFlow swin dataloader time: 0.036s (= 7.299s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.395s / 200, num_workers=8)
Relative speed: 0.602 (= 0.022s / 0.036s)

❌ OneFlow resnet50 time: 145.4ms (= 14543.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 169.7ms (= 16973.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 169.7ms / 145.4ms)

OneFlow resnet50 time: 97.7ms (= 9765.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.1ms (= 11114.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 111.1ms / 97.7ms)

OneFlow resnet50 time: 80.4ms (= 16079.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.5ms (= 17691.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.10 (= 88.5ms / 80.4ms)

OneFlow resnet50 time: 63.7ms (= 12741.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 86.7ms (= 17340.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 86.7ms / 63.7ms)

OneFlow resnet50 time: 56.3ms (= 11263.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.7ms (= 13732.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 68.7ms / 56.3ms)

@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2022

CI failed when running job: cuda-benchmark. PR label automerge has been removed

@github-actions github-actions bot removed the automerge label May 2, 2022
@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2022

CI failed when running job: cuda-module. PR label automerge has been removed

@grybd
Copy link
Contributor Author

grybd commented May 2, 2022

FAILED python/oneflow/test/modules/test_one_embedding_ftrl.py::TestOptimizers::test_ftrl
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
==== 1 failed, 854 passed, 180 skipped, 996 warnings in 3232.69s (0:53:52) =====

@grybd
Copy link
Contributor Author

grybd commented May 2, 2022

本地跑没有问题

@grybd grybd added the automerge label May 2, 2022
@grybd grybd requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 2, 2022 13:58
@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7571/

@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2022

CI failed when running job: cuda-benchmark. PR label automerge has been removed

@github-actions github-actions bot removed the automerge label May 2, 2022
@grybd grybd added the automerge label May 2, 2022
@grybd grybd requested review from oneflow-ci-bot and removed request for oneflow-ci-bot May 2, 2022 14:43
@mergify mergify bot merged commit eb628c7 into master May 2, 2022
@mergify mergify bot deleted the source_op_support_S branch May 2, 2022 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants