Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nn.Graph call and launch impl #5580

Merged
merged 28 commits into from
Jul 24, 2021
Merged

nn.Graph call and launch impl #5580

merged 28 commits into from
Jul 24, 2021

Conversation

chengtbf
Copy link
Contributor

@chengtbf chengtbf commented Jul 23, 2021

主要功能:

  • nn.Graph.__call__ 调用 compile 和 launch
  • nn.Graph.__launch__ 调用 RunLazyNNGraph
  • nn.Graph.__compile__ 注册 input/output/var op names; 启动 compile 和 runtime
    - [ ] nn.Graph 最小可运行 单测: Relu graph 调试工作交给 @leaves-zwx 后续完成

修复大量 BUG:

  • pybind11 传入 vector string 的 BUG
  • RunLazyNNGraph 传参 TensorTuple 的 BUG
  • MultiClientSessionContext 漏掉 Global ProfilerConf 的 BUG
  • NNGraph 编译 Plan 时没有回复 OpAttribute 的 BUG (提供 PlanUtil::PopulateOpAttibute)
  • MultiClientAddCallbackNotifier 时 lbn 设置错误的 BUG
  • CheckOpGraph 没有处理 Multi-Client 情形的 BUG
  • CallBackNotify 输入 dtype 的问题; tick dtype uint8 -> int8
  • WaitAndSendIds Kernel 运行 SegmentFault 的 BUG
    - [ ] 启动起来以后会卡住(两个线程100%占用并忙等)

提供新 Feature:

  • AutoRegistrationFactory 获取 creator 时的更多提示报错信息(模板类型 和 key类型)
  • convert_to_tensor_tuple 支持 空的 list
  • Kernel 的 注册机(宏)提供容错机制

@chengtbf chengtbf marked this pull request as ready for review July 23, 2021 11:20
@chengtbf
Copy link
Contributor Author

卡住的原因 Debug:

日志:

PhysicalRun::RunLazyJob instruction launched.
InstructionsBuilder::RunLazyJob
InstructionsBuilder::RunLazyJob:: new instruction.
InstructionsBuilder::RunLazyJob:: emplace back to instruction list.
cclog: WaitAndSendIdsKernel::GetBuffer name = ReluGraph_0.
LazyJobStreamType::Compute instruction proto: instr_type_name: "RunLazyJob"
.
RunLazyJobInstructionType::Compute begin.
RunLazyJobInstructionType::WaitUntilQueueEmptyIfFrontNNGraphNotEquals.
RunLazyJobInstructionType::GetAllBuffer and Send job_instance.
RunLazyJobInstructionType::EnqueueNNGraph.
LazyJobStreamType::Compute instructio compute.
cclog: WaitAndSendIdsKernel::Recevice job_instance down.
cclog: WaitAndSendIdsKernel::GetBuffer name = ReluGraph_0.

初步分析:

Kernel

  • CallbackNotifiler 没有触发;
  • WaitAndSendIds 已经触发了,而且第二次的 kernel 已经在等待了;

指令(从外到内)

  • InstructionBuilder::RunLazyJob 被完整执行了
  • RunLazyJobInstructionType 也被完整执行了
  • LazyJobStreamType 的 一次 Compute 也执行完了

@chengtbf chengtbf requested a review from oneflow-ci-bot July 23, 2021 12:56
@chengtbf chengtbf added automerge and removed WIP work in progress labels Jul 23, 2021
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review July 23, 2021 18:56
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 24, 2021 03:33
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review July 24, 2021 04:51
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review July 24, 2021 08:42
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 24, 2021 15:11
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 143.1ms (= 7155.0ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 126.2ms (= 6310.9ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.13 (= 143.1ms / 126.2ms)

PyTorch resnet50 time: 86.3ms (= 4313.1ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 74.3ms (= 3713.1ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.16 (= 86.3ms / 74.3ms)

PyTorch resnet50 time: 62.6ms (= 3132.4ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 55.9ms (= 2796.4ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.12 (= 62.6ms / 55.9ms)

PyTorch resnet50 time: 55.2ms (= 2759.7ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 48.1ms (= 2402.7ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.15 (= 55.2ms / 48.1ms)

PyTorch resnet50 time: 50.3ms (= 2515.6ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 68.6ms (= 3430.3ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.73 (= 50.3ms / 68.6ms)

@oneflow-ci-bot oneflow-ci-bot merged commit 3dcb993 into master Jul 24, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the dev_cc_nn_graph_run branch July 24, 2021 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants